About the new English Sentences

I really like that we have new sentences now, I’m seing a lot of new proper names, which for me as a non-native struggles a lit bit to pronounce, but it’s ok because it’ll improve my English, I have doubts about abbreviations such as “Inc” should I say just Inc or ‘in case’, I said it ‘inc’ just in case, pun intended.

Do you have an example where “Inc” is included?

It was about the name of a company I don’t recall much of it now.

I’ve seen it a couple of times. It’s from the wiki sentences.

OK, I understand that the code didn’t recognized that as an acronym (no upper cases) or an abbreviation (missing dot)

Some sentences are ridiculous. Here are some.

The characters for Kyoto are 京都 and Osaka’s are 大阪.

The township is in Schuylkill Valley School District.

Abolhasan Saba, Esmaeil Ghahremani and Ali-Naqi Vaziri were among his students.

Bundesliga club Kaiserslautern on a one-year contract.

It is also close to the Naskapi reserved land of Kawawachikamach.

Niche words are important, but “niche” is a niche word, not “Kawawachikamach”. Surely there’s a better way to get dictionary coverage that doesn’t involve foreign place names and twenty-letter scientific Latin terms.

I posted an issue about this here so feel free to chime in with any comments: https://github.com/mozilla/voice-web/issues/1958

1.5 million sentences were imported so even if only 3% are bad, that’s quite a lot. I’m working on a script to filter out most of them but some manual validation will still be needed.

3 Likes

We agree, thanks for flagging, we are looking into improving and fixing this.

Hi it’s me again, well I’ve come across some Japanese words in this sentence “The characters for Kyoto are 京都 and Osaka’s are 大阪.” When I don’t know how to pronounce a word I look it up with Google Translate, but this could lead to some mismatch I think. Was the inclusion of non-english words planned, is it a plan to collect voice data to other languages?

Merging all messages about this in this topic.

Yes, @gweber is looking into it, for now please skip these sentences.

Thanks!

I’ll do that, Roger that!

I’m recording some clips and sometimes there are sentences which are confusing, I think an option to disable these sentences for recording would be welcome,
example: [‘The original center ran perpendicular to W. Club Blvd.’]
I can’t record news clips without having to record this sentence, that’s why I think this option would be needed, I just recorded a silent clip.

I want to let you know that thanks to @dabinat great work we have filter-out a lot of sentences with issues (at the end there were just around 8%).

This changes should be reflected in the next deployment.

Thanks everyone for your valuable feedback to improve the project :slight_smile:

Cheers.

We also thank the Common Voice team for this awesome project!

@Codigo_Logo_Programacao_e_Inteligencia_Artificial
Just in-case it wasn’t mentioned before, there is a “skip” button which will give you a new phrase to record (so you still have 5). Isn’t it available to you?

As for “Inc” it would mean “incorporated” when about a company.
https://en.wikipedia.org/wiki/Inc.
The intention may have been to remove these strings, but when it’s a standard part of a company name like “Acme Inc” I’d say it could be read as “ink”. The safe bet is to skip, though.

Yeah I’m seeing this button now. Ok got it.

While it’s fantastic to have so many English sentences from Wikipedia, we shouldn’t assume that everything should come from there. WP sentences are typically straight facts which are often boring to record and review. They frequently include really obscure non-English proper names (such as villages in Russia) that aren’t at all useful for the dataset and are exceptionally hard for volunteers to read. And they mostly lack the proper names that we do need such as common English language given names.

If nobody objects I’ll re-start uploading sentences from interesting public domain books, with personal names replaced by script to increase name diversity.

Do you have ideas on how we can optimize the wikipedia extraction to avoid this issue?

I have been trying to filter out letter sequences that don’t tend to occur often in English. For example, there are no English words (that I know of) that contain the letter sequences “uuk” or “ijp”, so it filters sentences with words containing these letter sequences.

You can find my script here: https://github.com/dabinat/cvtools/blob/master/sentence_validator.py

(I also have a PR awaiting approval with these changes: https://github.com/mozilla/voice-web/pull/2040 )

1 Like

Yes, the Wikipedia stuff has improved word coverage a lot but it’s all past-tense, third-person fact description. We still need other sources for diversity.

I’m happy to review any sentences you upload.