About the new English Sentences

Codigo_Logo_Programacao_e_Inteligencia_Artificial · April 24, 2019, 12:16pm

I really like that we have new sentences now, I’m seing a lot of new proper names, which for me as a non-native struggles a lit bit to pronounce, but it’s ok because it’ll improve my English, I have doubts about abbreviations such as “Inc” should I say just Inc or ‘in case’, I said it ‘inc’ just in case, pun intended.

nukeador · April 24, 2019, 12:36pm

Do you have an example where “Inc” is included?

Codigo_Logo_Programacao_e_Inteligencia_Artificial · April 24, 2019, 12:44pm

It was about the name of a company I don’t recall much of it now.

dabinat · April 24, 2019, 3:12pm

I’ve seen it a couple of times. It’s from the wiki sentences.

nukeador · April 24, 2019, 3:19pm

OK, I understand that the code didn’t recognized that as an acronym (no upper cases) or an abbreviation (missing dot)

joshua.landau.ws · April 24, 2019, 1:42am

Some sentences are ridiculous. Here are some.

The characters for Kyoto are 京都 and Osaka’s are 大阪.

The township is in Schuylkill Valley School District.

Abolhasan Saba, Esmaeil Ghahremani and Ali-Naqi Vaziri were among his students.

Bundesliga club Kaiserslautern on a one-year contract.

It is also close to the Naskapi reserved land of Kawawachikamach.

Niche words are important, but “niche” is a niche word, not “Kawawachikamach”. Surely there’s a better way to get dictionary coverage that doesn’t involve foreign place names and twenty-letter scientific Latin terms.

dabinat · April 24, 2019, 2:50am

I posted an issue about this here so feel free to chime in with any comments: https://github.com/mozilla/voice-web/issues/1958

1.5 million sentences were imported so even if only 3% are bad, that’s quite a lot. I’m working on a script to filter out most of them but some manual validation will still be needed.

nukeador · April 24, 2019, 1:06pm

We agree, thanks for flagging, we are looking into improving and fixing this.

Codigo_Logo_Programacao_e_Inteligencia_Artificial · April 24, 2019, 3:47pm

Hi it’s me again, well I’ve come across some Japanese words in this sentence “The characters for Kyoto are 京都 and Osaka’s are 大阪.” When I don’t know how to pronounce a word I look it up with Google Translate, but this could lead to some mismatch I think. Was the inclusion of non-english words planned, is it a plan to collect voice data to other languages?

nukeador · April 24, 2019, 3:49pm

Merging all messages about this in this topic.

Yes, @gregor is looking into it, for now please skip these sentences.

Thanks!

Codigo_Logo_Programacao_e_Inteligencia_Artificial · April 24, 2019, 3:51pm

I’ll do that, Roger that!

Codigo_Logo_Programacao_e_Inteligencia_Artificial · April 24, 2019, 4:09pm

I’m recording some clips and sometimes there are sentences which are confusing, I think an option to disable these sentences for recording would be welcome,
example: [‘The original center ran perpendicular to W. Club Blvd.’]
I can’t record news clips without having to record this sentence, that’s why I think this option would be needed, I just recorded a silent clip.

nukeador · April 26, 2019, 1:30pm

I want to let you know that thanks to @dabinat great work we have filter-out a lot of sentences with issues (at the end there were just around 8%).

This changes should be reflected in the next deployment.

Thanks everyone for your valuable feedback to improve the project

Cheers.

Codigo_Logo_Programacao_e_Inteligencia_Artificial · April 26, 2019, 7:03pm

We also thank the Common Voice team for this awesome project!

JAGulin · May 10, 2019, 7:26am

@Codigo_Logo_Programacao_e_Inteligencia_Artificial
Just in-case it wasn’t mentioned before, there is a “skip” button which will give you a new phrase to record (so you still have 5). Isn’t it available to you?

github.com/common-voice/common-voice

Make Skip button more prominent

opened 08:51PM - 30 Dec 18 UTC

dabinat

UI Enhancement

There are lots of clips that are just background noise where the user clicked re…cord then stop very quickly. My theory for why these clips occur is that users want to skip the sentence but don’t know about the Skip button. I have even occasionally heard users say “I don’t want to say that” instead of the recording. So I think the Skip button should be made more prominent. I think it should be turned into an actual button with a border and maybe a colored background and put in the top right corner of the sentence box so it’s easier to notice when reading the sentence.

As for “Inc” it would mean “incorporated” when about a company.
Inc. - Wikipedia.
The intention may have been to remove these strings, but when it’s a standard part of a company name like “Acme Inc” I’d say it could be read as “ink”. The safe bet is to skip, though.

Codigo_Logo_Programacao_e_Inteligencia_Artificial · May 10, 2019, 7:42am

Yeah I’m seeing this button now. Ok got it.

Michael_Maggs · May 22, 2019, 12:50pm

While it’s fantastic to have so many English sentences from Wikipedia, we shouldn’t assume that everything should come from there. WP sentences are typically straight facts which are often boring to record and review. They frequently include really obscure non-English proper names (such as villages in Russia) that aren’t at all useful for the dataset and are exceptionally hard for volunteers to read. And they mostly lack the proper names that we do need such as common English language given names.

If nobody objects I’ll re-start uploading sentences from interesting public domain books, with personal names replaced by script to increase name diversity.

nukeador · May 22, 2019, 12:52pm

Do you have ideas on how we can optimize the wikipedia extraction to avoid this issue?

dabinat · May 23, 2019, 4:00am

I have been trying to filter out letter sequences that don’t tend to occur often in English. For example, there are no English words (that I know of) that contain the letter sequences “uuk” or “ijp”, so it filters sentences with words containing these letter sequences.

You can find my script here: https://github.com/dabinat/cvtools/blob/master/sentence_validator.py

(I also have a PR awaiting approval with these changes: https://github.com/mozilla/voice-web/pull/2040 )

dabinat · May 23, 2019, 4:03am

Yes, the Wikipedia stuff has improved word coverage a lot but it’s all past-tense, third-person fact description. We still need other sources for diversity.

I’m happy to review any sentences you upload.

Topic		Replies	Views
We want your feedback: Improving the sentence collection Common Voice sentence-collection , feedback	39	8876	January 9, 2019
Sentence collection tool development topic Common Voice sentence-collection , announcements	32	4033	January 26, 2019
Question about CV Sentence Extractor quality and your experience Common Voice	18	1562	August 30, 2023
Bulk sentences submission from Wikipedia Common Voice sentence-collection	4	586	August 12, 2024
I'm almost giving up on the project. Feedback from a big contributor (10000 sentences sent, 7000 listened) Common Voice	24	2255	March 15, 2023

About the new English Sentences

Related topics