Sentence collection tool development topic

DNGros · January 3, 2019, 12:07am

I’m excited to see this up and running! Common voice is shaping into a really powerful dataset, but I’ve worried for a while whether there was enough sentence diversity in the dataset. Thanks for your work on this tool.

Below are listed some things where I had to second guess myself when reviewing or writing. These are areas where there might be need for better reviewer/submitter guidelines or candidates for a FAQ section for the guidelines.

Autovalidation specific issues

Should acronyms really be ruled invalid?
The current guidelines and autovalidator don’t allow abbreviations and acronyms because they are can be pronounced many ways. However, in my opinion the fact that they can be pronounced in fact makes them valuable to a dataset, not harmful. A STT/ASR system must be able to handle all the many ways users pronounce words and have a rich enough language model to capture acronym correctly. Pronounced acronyms like “FIFA” or “LIDAR”, as well as “spelled out” acronyms like “CO2”, “USB”, “FC Barcelona”, “FAQ” are common parts of language and potential user queries. I think they should be included in the dataset.
Longer max length
The current max length is set at 14 words. This is fairly short, and I found several the sentences I wrote being rejected. I would say in general we shouldn’t disallow longer sentences, as it is important for practical STT/ASR systems to be able to handle longer sentences (the one you are finishing reading is 24 words). More datadriven analysis could be done on what is typical sentence length for spoken sentences, but I would vote 30 words is a sensible max.
Number words
Certain words which include digits. It seems these should not be spelled out and digits should not be auto-rejected
Ex: “San Francisco 49ers”, “H2O”, “5G wireless network”.

Usability features

Inform user why sentences were autorejected (length, digits, abbreviation, etc).
Bug: Hitting “submit” on the add sentences form sometimes does nothing. For example when I tried to just copy and past the 438 sentences I wrote here https://pastebin.com/C0YpxtSJ it will not accept it and fails without any user feedback. Certain subsets of the paste are able to progress to the next screen. I’m not able right now to investigate further what is happening, but is is some bug. If I figure that issue out I can open an issue/PR on github.

Potential areas for improved guidlines

Is incorrect capitalization incorrect?
Ex: “I’ve heard that this hainanese chicken rice tastes amazing!”
hainanese should be capitalize.
However, this should not effect pronunciation and I would vote that the guidelines rule this valid.
Should misplaced commas be invalid?
Ex: “If we’re in London, and trying to download a cat picture from a website, based in Australia, fetching the data results in a lot of hops along a lot of networks.”
The comma after “website” is unnecessary and implies an unnatural pronunciation.
Ex: “We also emit CO2 when we burn fossil fuels to keep places we work well-lit and depending where you are in the world warm or cool enough to work in.”
This really should have some commas to be “grammatically correct”. However, it is still pronounceable, and thus could be accepted if that is what the guidelines encourage.
Clarify that dates are numbers which should be spelled out.
Ex: “In 2017 we saw an increase in growth.”
This might not be clear form the current guidelines
Hyphen in numbers or not? Or are both valid.
“nineteen eighty-eight” or “nineteen eighty eight”
emdashes
Ex: “When we look at this map, it gives us an idea of how clean the emissions are likely to be, on average for a country - for example we can see that in France, with its heavy reliance on nuclear energy, generating a kilowatt hour of electricity than Poland, which is heavily invested in coal.”
This sentence would good challenge to a system and is grammatically valid (other than the fact that it uses a hyphen rather than an em dash). However, a pannel2 of human transcribers given only the audio probably would not agree on the transcription as both a period and a dash sound similar. So it seems like this should be invalid. Agreed?
Should foreign characters common in a language be allowed?
For example: letters with diacritics sometimes appear in English when using words (especially foreign proper nouns) like “Nestlé”, “L’Oréal”, “exposé”, " José", and " Sofía". Should these be allowed? Currently they would get autorejected.
colloquial grammar / “as they said it spelling”
For example: “I ain’t gonna put up with this”, “why ya’ gotta be like that”, " That’s one mean lookin’ motorcycle man."
Is this allowed?
This could be valuable for a system to recognize as it might better reflect how some real users talk and system should not fail the user says “gonna” instead of “going to”. It isn’t “grammatically correct”, but this kind of writing is sometimes used in fiction to communicate a characters personality more.

All of these might not need specific guidelines, and for some we could just let the reviewers vote. However, I’ll add them here in case others also had the same questions and we want to try and standardize things more.

davidak · January 3, 2019, 7:59am

A general thing i noticed:

Why is this tool not part of Common Voice and follows it’s design?

More rules might be needed. For example don’t add offensive sentences violating the Code of Conduct. And a hint what good sentences are, for example general facts.

Also a hint where public domain resources can be found.

nukeador · January 3, 2019, 11:52am

This is just a quick tool to solve a small group of people need, in order to have it available sooner we decided not to create a super polished version but rather a functional version that can be improved over time. Ideally we want an integrated sentence collector during 2019 in the main portal.

Just filled:

nukeador · January 3, 2019, 11:30am

Thanks for the feedback, we have just followed the guidelines provided by the Deep Speech team in order to make sure the resulting sentences+voices are useful for the algorithms. @josh_meyer might be able to provide you a bit more detail on the reasoning behind.

Cheers.

nukeador · January 3, 2019, 11:46am

These two issues are tracking that:

dabinat · January 3, 2019, 4:14pm

Here are my first impressions:

The registration process is confusing - you have to first login with your desired username and password in order to register. I think there should be an actual registration page, even if it functions largely the same as the login page, because it’s more logical and mirrors how other sites work.
The review process was also confusing to me. You tap Yes and it turns green. Does that mean it’s approved? But it’s still there when I return to the page. Oh, I have to physically click the Submit button, which was right by the page number box, so I thought it was for changing the page. Having it auto-submit, hide the sentence and show the next sentence would be a lot better.
What is the purpose of the Skip button? I can choose to ignore sentences without consequence and I can navigate to any page I like, so what advantage does skipping provide?
It seems pretty easy to add invalid sentences. I entered a period for the sentence and the source and it told me the sentence already exists (implying it would have allowed it if it didn’t). I then changed it to a comma in the sentence box and it went through and ended up as a blank sentence.

Other suggestions:

I don’t think users should be allowed to review their own sentences.
Should it reject if the first letter of a sentence doesn’t begin with a capital letter? That might help to catch copy and paste errors where there was an errant newline in the middle of the sentence.

nukeador · January 3, 2019, 5:23pm

We decided to stick with the current login system to avoid adding additional work for this version, in the next deployment we have added some explanation about how it works in the same page.

Good idea, currently we are going to stick with this workflow (click yes on everything you want to approve and then finally submit) but we can see what other improvement we can do in the future, I’ve opened this issue to track it.

The skip button is broken right now, we plan to remove it in the next deployment.

Can you describe this in more detail in a new issue? I don’t know if I fully understand what you got.

This is intentional since we know a lot of people will be adding a lot of sentences from different sources other than their own creation.

Not sure, this might limit the ability for people to get sentences from long paragraphs and split them into smaller ones you can still read and make sense.

mkohler · January 3, 2019, 6:55pm

oh, I’ve filed https://github.com/Common-Voice/sentence-collector/issues/58 for this…

This might work in English, but then we would need to define this per language as not all language probably use uppercase?

I will fix a few more bugs now and then deploy again.

nukeador · January 4, 2019, 10:56am

A post was split to a new topic: Problems finding public domain sentences

mkohler · January 3, 2019, 8:09pm

I’ve deployed the latest version with many bugfixes, however the skip button will still be there. Will need to think about that removal a bit more. It should however not mark the sentence as approved/rejected anymore and do nothing. This is tracked in https://github.com/Common-Voice/sentence-collector/issues/44

mkohler · January 6, 2019, 10:51pm

I’ve deployed a lot more fixes for both bugs and UX topics. Would be great if all of you can keep testing to make sure I didn’t introduce new bugs

nukeador · January 8, 2019, 12:19pm

Thanks @txopi @DNGros @davidak @dabinat @jef.daniels and everyone who is helping with the QA phase, we have been able to fix a lot of issues and make the tool better thanks to your help.

Let’s do a final push in the next couple of days to make sure we are ready to move the tool to beta phase, let’s keep testing and reporting issues.

In the beta phase we will clean-up the database and offer to everyone in the common voice community who has been asking to get their sentences included to star using it as the main channel for sentence submission and review.

Sentences added and reviewed in the beta phase will start being incorporated in the main Common Voice site. We expect some languages to reach out the 5000 sentences and allow them to enable the voice collection

nukeador · January 9, 2019, 11:11am

5 posts were split to a new topic: Feedback on how we collect and validate sentences

mkohler · January 8, 2019, 8:22pm

I have deployed a new version with several fixes. See the “CHANGELOG” column in https://github.com/Common-Voice/sentence-collector/projects/1 . Thanks everyone for your feedback and reporting bugs, this is getting better and better

DNGros · January 8, 2019, 10:42pm

Ok, great. Sorry for not checking for existing issues before commenting.

DNGros · January 8, 2019, 11:06pm

It would be great to hear from @josh_meyer or others involved with that team especially on the issue around acronyms as well as around how the 14 word limit was chosen. Specifically on acronyms I worry whether systems trained on Common Voice will be able to handle the acronyms that occur all over everyday speech, unless we allow them in the dataset.

Other areas that still seems to need more clarity is the issues around non-A-Z characters should be allowed. I was going through and reviewing some more sentences that are on the site now, and was not sure whether “Lucas played in the São Paulo soccer team” should be rejected or accepted.

Codigo_Logo_Programacao_e_Inteligencia_Artificial · January 9, 2019, 6:46pm

I sent Portuguese sentences to the English one by accident, not sure if I mismatch the combobox or the site didn’t recognize Portuguese and the sent those sentences to the English dataset, How do I remove those sentences?

nukeador · January 9, 2019, 9:16pm

Currently there is no way to remove sentences, but don’t worry, we’ll clean the database before moving to the beta phase.

mkohler · January 13, 2019, 8:08pm

I’ve just deployed the latest changes to the website. You can find the changelog here: https://github.com/Common-Voice/sentence-collector/releases

jumasheff · January 14, 2019, 10:13am

Hey, do you have an endpoint, where I can post all the scraped data I have? The texts are scraped from a news website (ky.kloop.asia) a founder of which has generously shared all of its contents under CC0 (via facebook chat conversation, a screenshot can be shared upon request). It’s boring to copy-paste all the news articles from 2011 to 2018, you know
Or is it easier for you to accept all text data by email or other means? Perhaps a github repo with all the texts will work for you?