Contributing many sentences that contain not yet spoken words

skinkie · June 11, 2023, 3:08pm

I am currently participating in the Dutch (nl) category and have downloaded the last published dataset. Analysing validated.tsv I have noticed that it contains about 28840 unique words. For Dutch we have a very active team at OpenTaal (Open Language) that collects an officially certified Dutch lexicon including toponyms. Lets make that number of unique tokens 323028.

To me it sounds as a no-brainer to collect sentences which contains these not yet included words in the dataset. Which makes an engineer wonder: is there any way available to contribute these in an automatic fashion?

bozden · June 13, 2023, 6:09am

Ref: https://github.com/common-voice/common-voice/issues/4076

bozden · June 13, 2023, 7:06am

You can check a more detailed analysis here:
https://analyzer.cv-toolbox.web.tr/examine/nl/13.0

I’ll post some important points below.

bozden · June 13, 2023, 7:16am

Your text corpus has many sentences, but not all of them are recorded yet. An important issue is that there are many multiples if you normalize them.

18.7% is a high value IMHO.

In the validated.tsv, you have only 53,772 different sentences, so, only a small portion of your text corpus has been covered (20.8% of the normalized unique sentences).
Total tokens (normalized) in all text-corpus is 48,009, probably you’ve got your 28,840 from validated.tsv file.

skinkie · June 13, 2023, 7:26am

Yes, I have used the validated.tsv. I’ll analyse the corpus published on github against the OpenTaal wordlist. That may still give some opportunities, hopefully less politically loaded (and removes the non-conversational “European Commission” Dutch…)

bozden · June 13, 2023, 7:39am

The use of a readily available corpus as it is has always been problematic. This is why we insist on offline pre-checks I mentioned in github. Although it is manual and requires a huge amount of man-hours, I think it is the only way to get a quality text-corpus.

Once these sentences get recorded, you cannot get rid of them either, except post-processing the downloaded dataset prior to training. E.g. if you find offending sentences in that corpora, you can build a blacklist for sentences and remove them from training sets.

French had such an issue because of the proper names from very old & unfiltered sentence-extractor imports (Wikipedia) which introduced too many proper names, effectively dropping the word-error-rate of models they create.

skinkie · June 13, 2023, 8:42am

What is wrong with proper names in a dataset? Wouldn’t toponyms and brands serve for better TTS?

bozden · June 13, 2023, 8:53am

I think you meant STT, Common Voice is not suited for TTS.

If they are in the language at hand they will be OK. But if you have language X and many proper names are from other languages, you will have a problem. Contributors to that language may record with many different pronunciations, mostly wrong if they don’t know the language.

Many corpora have too many (mostly English) proper nouns, e.g. the EuroParl cospus have them (references to political figures) for varieties of other nations.

Same for toponyms. E.g. in Turkish we use “Vaşington” for “Washington”. How would you pronounce Paris? As “parii” or “Paris”?

skinkie · June 13, 2023, 8:52am

I have done a quick analysis for danielsjf.txt, europarl.v7.txt, pcmill.txt and sentence-collector.txt against OpenTaal. That gives me 306040 remaining tokens not used in a sentence.

Suprisingly a word as ‘lobster’ is missing for the European Parlement data. I would have expected they discussed their food during their assembly

But also much more common words like ‘headphones’, or ‘highway’ are missing.

skinkie · June 13, 2023, 11:59am

Regarding toponyms, I would expect the Dutch variant “Parijs”, especially if language has been provided as extra context.

kathyreid · June 14, 2023, 2:23am

I would like to make a distinction here between words that are not spoken in the corpus, and phonemes that are under-represented in the corpus.

Different ASR systems - Common Voice is intended for ASR, even though it is bastardised into other speech technologies - use different approaches, such as character recognition then word prediction (so something like a CTC beam search decoder, with a language model applied, as DeepSpeech does).

What I have not seen in corpus analysis of Common Voice is:

A character analysis of which characters and character combinations that represent phonemes - gh, th, ck, ch etc - are in the dataset.
A phonetic analysis of phoneme frequency in the dataset.

For example, I suspect but have not proven, that Common Voice over-represents the th sound - the /θ/ phoneme, and under-represents phonemes like /dʒ/ - the j sound in jog or jumper.

If we are considering creating sentences that include new words - that will have an impact at the language model or word error rate level, we should also analyse the character and phonetic distribution of the corpus to ensure that we have good coverage at this level also.

I am thinking here of the Harvard Sentences.

bozden · June 14, 2023, 7:33am

@kathyreid, thank you, never thought of that. I think I should take some linguistic classes

@Francis_Tyers’s commonvoice-utils (covo) has some tools for this I think, not for all CV languages but many of them. Never used them as I was not aware of the importance

convert-grapheme-input-to-phonemes

I wonder if it would be possible to extend the tooling to include these… Would it help?

On the other side, volunteers here also are mostly not linguists…