Quick update: We still have some pending issues for the MPV, ideally we would have a beta version to test before the end of the year, if the QA of that beta is satisfactory we can start using it to collect and review sentences.
I’m excited to see this up and running! Common voice is shaping into a really powerful dataset, but I’ve worried for a while whether there was enough sentence diversity in the dataset. Thanks for your work on this tool.
Below are listed some things where I had to second guess myself when reviewing or writing. These are areas where there might be need for better reviewer/submitter guidelines or candidates for a FAQ section for the guidelines.
Autovalidation specific issues
Should acronyms really be ruled invalid?
The current guidelines and autovalidator don’t allow abbreviations and acronyms because they are can be pronounced many ways. However, in my opinion the fact that they can be pronounced in fact makes them valuable to a dataset, not harmful. A STT/ASR system must be able to handle all the many ways users pronounce words and have a rich enough language model to capture acronym correctly. Pronounced acronyms like “FIFA” or “LIDAR”, as well as “spelled out” acronyms like “CO2”, “USB”, “FC Barcelona”, “FAQ” are common parts of language and potential user queries. I think they should be included in the dataset.
Longer max length
The current max length is set at 14 words. This is fairly short, and I found several the sentences I wrote being rejected. I would say in general we shouldn’t disallow longer sentences, as it is important for practical STT/ASR systems to be able to handle longer sentences (the one you are finishing reading is 24 words). More datadriven analysis could be done on what is typical sentence length for spoken sentences, but I would vote 30 words is a sensible max.
Certain words which include digits. It seems these should not be spelled out and digits should not be auto-rejected
Ex: “San Francisco 49ers”, “H2O”, “5G wireless network”.
Inform user why sentences were autorejected (length, digits, abbreviation, etc).
Bug: Hitting “submit” on the add sentences form sometimes does nothing. For example when I tried to just copy and past the 438 sentences I wrote here https://pastebin.com/C0YpxtSJ it will not accept it and fails without any user feedback. Certain subsets of the paste are able to progress to the next screen. I’m not able right now to investigate further what is happening, but is is some bug. If I figure that issue out I can open an issue/PR on github.
Potential areas for improved guidlines
Is incorrect capitalization incorrect?
Ex: “I’ve heard that this hainanese chicken rice tastes amazing!”
hainanese should be capitalize.
However, this should not effect pronunciation and I would vote that the guidelines rule this valid.
Should misplaced commas be invalid?
Ex: “If we’re in London, and trying to download a cat picture from a website, based in Australia, fetching the data results in a lot of hops along a lot of networks.”
The comma after “website” is unnecessary and implies an unnatural pronunciation.
Ex: “We also emit CO2 when we burn fossil fuels to keep places we work well-lit and depending where you are in the world warm or cool enough to work in.”
This really should have some commas to be “grammatically correct”. However, it is still pronounceable, and thus could be accepted if that is what the guidelines encourage.
Clarify that dates are numbers which should be spelled out.
Ex: “In 2017 we saw an increase in growth.”
This might not be clear form the current guidelines
Hyphen in numbers or not? Or are both valid.
“nineteen eighty-eight” or “nineteen eighty eight”
Ex: “When we look at this map, it gives us an idea of how clean the emissions are likely to be, on average for a country - for example we can see that in France, with its heavy reliance on nuclear energy, generating a kilowatt hour of electricity than Poland, which is heavily invested in coal.”
This sentence would good challenge to a system and is grammatically valid (other than the fact that it uses a hyphen rather than an em dash). However, a pannel2 of human transcribers given only the audio probably would not agree on the transcription as both a period and a dash sound similar. So it seems like this should be invalid. Agreed?
Should foreign characters common in a language be allowed?
For example: letters with diacritics sometimes appear in English when using words (especially foreign proper nouns) like “Nestlé”, “L’Oréal”, “exposé”, " José", and " Sofía". Should these be allowed? Currently they would get autorejected.
colloquial grammar / “as they said it spelling”
For example: “I ain’t gonna put up with this”, “why ya’ gotta be like that”, " That’s one mean lookin’ motorcycle man."
Is this allowed?
This could be valuable for a system to recognize as it might better reflect how some real users talk and system should not fail the user says “gonna” instead of “going to”. It isn’t “grammatically correct”, but this kind of writing is sometimes used in fiction to communicate a characters personality more.
All of these might not need specific guidelines, and for some we could just let the reviewers vote. However, I’ll add them here in case others also had the same questions and we want to try and standardize things more.
This is just a quick tool to solve a small group of people need, in order to have it available sooner we decided not to create a super polished version but rather a functional version that can be improved over time. Ideally we want an integrated sentence collector during 2019 in the main portal.
Thanks for the feedback, we have just followed the guidelines provided by the Deep Speech team in order to make sure the resulting sentences+voices are useful for the algorithms. @josh_meyer might be able to provide you a bit more detail on the reasoning behind.
The registration process is confusing - you have to first login with your desired username and password in order to register. I think there should be an actual registration page, even if it functions largely the same as the login page, because it’s more logical and mirrors how other sites work.
The review process was also confusing to me. You tap Yes and it turns green. Does that mean it’s approved? But it’s still there when I return to the page. Oh, I have to physically click the Submit button, which was right by the page number box, so I thought it was for changing the page. Having it auto-submit, hide the sentence and show the next sentence would be a lot better.
What is the purpose of the Skip button? I can choose to ignore sentences without consequence and I can navigate to any page I like, so what advantage does skipping provide?
It seems pretty easy to add invalid sentences. I entered a period for the sentence and the source and it told me the sentence already exists (implying it would have allowed it if it didn’t). I then changed it to a comma in the sentence box and it went through and ended up as a blank sentence.
I don’t think users should be allowed to review their own sentences.
Should it reject if the first letter of a sentence doesn’t begin with a capital letter? That might help to catch copy and paste errors where there was an errant newline in the middle of the sentence.
We decided to stick with the current login system to avoid adding additional work for this version, in the next deployment we have added some explanation about how it works in the same page.
Good idea, currently we are going to stick with this workflow (click yes on everything you want to approve and then finally submit) but we can see what other improvement we can do in the future, I’ve opened this issue to track it.
The skip button is broken right now, we plan to remove it in the next deployment.
Can you describe this in more detail in a new issue? I don’t know if I fully understand what you got.
This is intentional since we know a lot of people will be adding a lot of sentences from different sources other than their own creation.
Not sure, this might limit the ability for people to get sentences from long paragraphs and split them into smaller ones you can still read and make sense.
I’ve deployed the latest version with many bugfixes, however the skip button will still be there. Will need to think about that removal a bit more. It should however not mark the sentence as approved/rejected anymore and do nothing. This is tracked in https://github.com/Common-Voice/sentence-collector/issues/44
Let’s do a final push in the next couple of days to make sure we are ready to move the tool to beta phase, let’s keep testing and reporting issues.
In the beta phase we will clean-up the database and offer to everyone in the common voice community who has been asking to get their sentences included to star using it as the main channel for sentence submission and review.
Sentences added and reviewed in the beta phase will start being incorporated in the main Common Voice site. We expect some languages to reach out the 5000 sentences and allow them to enable the voice collection