We are coming from here
Feel free to open a topic here and we can figure out how to expose it from the tool.
Yes, that’s planned in our roadmap.
There is a cleanup planned for all the sentences in the repo, @gregor will be running it.
No acronyms are allowed, because we don’t really know how people are going to read them. They can be clear for you but we can’t ensure how other people will read them. I see you opened a topic about this.
Well, if a sentences got a negative vote but then other 2 people voted positive, it will get validated. There is little we can do if people are voting positive here, we can however identify who voted on a specific sentence (checking the sentences json) and try to provide these people with feedback about it.
We can also think how to implement a system that allow us catch and fix this small percentage of wrong sentences that end up validated. Maybe from the voice web site? Other?
Yes, I think both the recording and validation pages on the CV site should have a way to flag up a sentence. Maybe once it gets x flags it ends up being automatically added to the review queue again?
I also think it would be useful to be able to comment on sentences, ideally in a way that can be seen by everyone, not just the original author. Because sometimes there are edge cases outside of the rules and it would be helpful to engage with others and get their opinions. For instance, I had a sentence today that was something like “One way to annoy Gill is to spell her name Jill.” The G should be pronounced like a J but I don’t know how obvious that is to people who aren’t British. The sentence doesn’t break the rules but it might end up that the majority of readers get it wrong.
There’s also another factor that the rules don’t cover but is bound to come up sooner or later: what is classed as offensive? What is offensive to some may not be to others, so that’s why discussion between validators is needed IMO.
I think you have mis-scored many of those clips. For example:
According to the poet, April is the cruellest month
- TS Eliot being called out for an alleged spelling error. Whatever next?
Cathy took two hundred and ten pounds out of her bank account,
- Common British idiom, and the comma at the end doesn’t matter.
His central heating had stopped working,
- Common British idiom, and the comma at the end doesn’t matter.
The correct pronunciation is.
- Seems a valid clip. Not a full sentence admittedly, but quite readable.
The drawer drew a lifelike image of the man.
- “Drawer: a person or thing that draws” (Collins Dict). Nothing wrong with this at all.
By my count, you’ve mis-validated half the examples you gave. I also find it rather worrying that you give “One way to annoy Gill is to spell her name Jill” as being “an edge case outside the rules.” No it’s not: it a perfectly straightforward English sentence that 100% of British English speakers would read correctly without hesitation. It brings in two very common spelling of the same name, and of course we hope will be read properly, not guessed at by readers who may never have seen one of the spellings. If I see a US word I don’t know how to pronounce, I skip it: isn’t that what the skip button is for? May I suggest you bear in mind when reviewing sentences, too, that not all versions of English use the same orthography and grammar as you do? We are building a corpus of many different varieties of English (of which US and UK are only two examples).
Michael
Having said that, I quite agree that clearer validation rules are needed, and ideally some way of discussion and feedback for individual clips. It seems there’s quite a bit of disagreement between reviewers, as is evident by the consistency lowish ‘accuracy’ scores across the validation leaderboard.
Trailing commas are a mistake. Why accept sloppiness? If we’re going to intentionally allow mistakes through, why bother validating at all?
Also, I think you’ll agree that cutting a sentence short makes it harder to
Just wanted to add, regarding this:
I have validated almost 70,000 recordings and I can assure that’s not what happens. Case in point: the real sentence in the corpora “The Qt framework is pronounced ‘cute’.” Almost everyone gets this wrong. I get the impression a lot of users don’t read the sentence first before hitting the record button. And I don’t think many users bother to listen to their clips or re-record.
Also, I never said that sentence was bad. I said I wasn’t sure if it was too ambiguous and wanted to check with others.
I agree that trailing commas are not helpful, and I’ve suggested on Github that they should be rejected by the upload filter: https://github.com/mozilla/voice-web/issues/1815
Great, thank you for submitting that. Although Sentence Collector actually has its own separate repo: https://github.com/Common-Voice/sentence-collector/issues
Ah, thanks. Didn’t know that. Will repost there tomorrow if need be.
@gregor Feel free to contact me (also on Github) when you start your cleanup. I have identified at least a dozen mistakes in the German sentences added through the collector tool.