I’m excited to see this up and running! Common voice is shaping into a really powerful dataset, but I’ve worried for a while whether there was enough sentence diversity in the dataset. Thanks for your work on this tool.
Below are listed some things where I had to second guess myself when reviewing or writing. These are areas where there might be need for better reviewer/submitter guidelines or candidates for a FAQ section for the guidelines.
Autovalidation specific issues
- Should acronyms really be ruled invalid?
The current guidelines and autovalidator don’t allow abbreviations and acronyms because they are can be pronounced many ways. However, in my opinion the fact that they can be pronounced in fact makes them valuable to a dataset, not harmful. A STT/ASR system must be able to handle all the many ways users pronounce words and have a rich enough language model to capture acronym correctly. Pronounced acronyms like “FIFA” or “LIDAR”, as well as “spelled out” acronyms like “CO2”, “USB”, “FC Barcelona”, “FAQ” are common parts of language and potential user queries. I think they should be included in the dataset. - Longer max length
The current max length is set at 14 words. This is fairly short, and I found several the sentences I wrote being rejected. I would say in general we shouldn’t disallow longer sentences, as it is important for practical STT/ASR systems to be able to handle longer sentences (the one you are finishing reading is 24 words). More datadriven analysis could be done on what is typical sentence length for spoken sentences, but I would vote 30 words is a sensible max. - Number words
Certain words which include digits. It seems these should not be spelled out and digits should not be auto-rejected
Ex: “San Francisco 49ers”, “H2O”, “5G wireless network”.
Usability features
- Inform user why sentences were autorejected (length, digits, abbreviation, etc).
- Bug: Hitting “submit” on the add sentences form sometimes does nothing. For example when I tried to just copy and past the 438 sentences I wrote here https://pastebin.com/C0YpxtSJ it will not accept it and fails without any user feedback. Certain subsets of the paste are able to progress to the next screen. I’m not able right now to investigate further what is happening, but is is some bug. If I figure that issue out I can open an issue/PR on github.
Potential areas for improved guidlines
-
Is incorrect capitalization incorrect?
Ex: “I’ve heard that this hainanese chicken rice tastes amazing!”
hainanese should be capitalize.
However, this should not effect pronunciation and I would vote that the guidelines rule this valid. -
Should misplaced commas be invalid?
Ex: “If we’re in London, and trying to download a cat picture from a website, based in Australia, fetching the data results in a lot of hops along a lot of networks.”
The comma after “website” is unnecessary and implies an unnatural pronunciation.
Ex: “We also emit CO2 when we burn fossil fuels to keep places we work well-lit and depending where you are in the world warm or cool enough to work in.”
This really should have some commas to be “grammatically correct”. However, it is still pronounceable, and thus could be accepted if that is what the guidelines encourage. -
Clarify that dates are numbers which should be spelled out.
Ex: “In 2017 we saw an increase in growth.”
This might not be clear form the current guidelines -
Hyphen in numbers or not? Or are both valid.
“nineteen eighty-eight” or “nineteen eighty eight” -
emdashes
Ex: “When we look at this map, it gives us an idea of how clean the emissions are likely to be, on average for a country - for example we can see that in France, with its heavy reliance on nuclear energy, generating a kilowatt hour of electricity than Poland, which is heavily invested in coal.”
This sentence would good challenge to a system and is grammatically valid (other than the fact that it uses a hyphen rather than an em dash). However, a pannel2 of human transcribers given only the audio probably would not agree on the transcription as both a period and a dash sound similar. So it seems like this should be invalid. Agreed? -
Should foreign characters common in a language be allowed?
For example: letters with diacritics sometimes appear in English when using words (especially foreign proper nouns) like “Nestlé”, “L’Oréal”, “exposé”, " José", and " Sofía". Should these be allowed? Currently they would get autorejected. -
colloquial grammar / “as they said it spelling”
For example: “I ain’t gonna put up with this”, “why ya’ gotta be like that”, " That’s one mean lookin’ motorcycle man."
Is this allowed?
This could be valuable for a system to recognize as it might better reflect how some real users talk and system should not fail the user says “gonna” instead of “going to”. It isn’t “grammatically correct”, but this kind of writing is sometimes used in fiction to communicate a characters personality more.
All of these might not need specific guidelines, and for some we could just let the reviewers vote. However, I’ll add them here in case others also had the same questions and we want to try and standardize things more.