🔢 Number of sentences we need to create a good Chinese STT model

When we deploy this effort , I pretty hope don’t use it on Chinese at first. Because quality is more important then quantity in sentences.

We already cover 55% of Chinese pronunciation with only 3600 sentences (and 75% if we don’t consider tone), I really don’t think we need 2M of not-so-fluency sentences over 20K of good-quality easy-reading (and sometimes - funny enough) sentences.

The people who’re using CV dataset now are pretty satisfied with the current coverage. For the practice, we just need to have more mins of records, but not more sentences.

Bad samples will affect our promotion efforts, block the people’s interesting to record more mins - that’s our final goal.

I asked about this and we have been told a few times that 1 sentence per clip needed is what we need, and for 2000hrs that’s 1,8M sentences. Maybe @kdavis or @josh_meyer can chime in here and provide more context.

If we have 800 people recording same 1000 sentences, each length average 5s, than we will have 4M seconds = 1000 hours length.

So I don’t really know why we need to focus on getting more sentences. I believe the more important index is the coverage rate of pronunciation.

We only have 1500 different pronunciations for all 20K Chinese characters, (only 5k of them are common use in daily life), and we already have 900 pronounciation in our sentences db.

I’m plan to increase the percentage to 70% in this month by double the sentences number to 7500.

@nukeador & @irvin

We don’t want to repeat sentences if possible. In an ideal world, each text sentence would only be read only once and by only one speaker. As @irvin says, different pronunciations are important. However, a model trained on only 3,600 sentences will be biased to the text in those sentences. The more sentences we use, the less it’s possible for the model to be biased towards those sentences. So to be blunt, I strongly recommend that we do not repeat the good sentences we have until we get 2,000 hours of recordings.

However, I agree with @irvin that reading awkward sentences from Wikipedia is bad. This sounds like a recipe for disaster, especially with the articles being mixed and then some machine translation going on in the background. People will stumble reading over sentences, so it will both frustrate speakers and lead to bad recordings.

If we can find a way to isolate “probably good” sentences from Wikipedia and use those, this might be a good idea.

With Mandarin having so many different characters, we really need as much diversity as possible in the text. With English, if we have 1,000 sentences or 1,000,000 sentences, both corpora will still contain all 26 characters. For Mandarin, on the other hand, there will be a big difference between the unique character set of 1,000 sentences compared to 1,000,000 sentences. This is also very important for people using this dataset for end-to-end speech recognition (like DeepSpeech).

If there are characters missing from the transcripts, our models will not be able to produce them. This may be fixed with some NLP after the fact, but we really want to have as many characters as possible.

(update: re-write for clearly explain on 26 Feb 3:40 UTC)


I guess I know the problem. The amounts of voice “unit” we need in Mandarin and in English may not be that big that you think. Let me try to explain why there is that big difference in my imagination and in yours.


Tl;dr, The final Deep Speech model we need is actually a Voice-Phonetics model, rather than a Voice-Chinese model.

When machine hears the voice, its output should not be “Chinese characters”, but “romanized phonetics” (either in Zhuyin or Pinyin).

We feed the model with best_browser_in_the_world.mp3,
and the output should be,

  • “g4 ru,4 g;4 yjo4 cl3 2k7 xu.6 x03 fu4” (in Zhuyin)
  • “Shì jiè shàng zuì hǎo de liú lǎn qì” (or in Pinyin)

but not,

  • 世界上最好的瀏覽器 (in Chinese characters)

The phonetic-based input method (which all of us use them) is a classic and popular open source topic been researched for 20 years,

Once people got the above output, we can easily convert it back to “世界上最好的瀏覽器”, with Zhuyin or Pinyin dictionary table*, which contains all phonetics combination of phrases/chars and it’s frequency.

* Zhuyin table for example


Why we want the model to output pronunciation but not Chinese characters?

Because we have too many characters.

Consider 5k common characters (which most people only learn and use this amount, ideally, if we can have samples on all the sentences that people would speak, so it can learn to identify every sentence with their voice combination. with 20 chars length sentences, that’s 5k^20 combination, so that’s not gonna work.

So, in practice, we break the sentences into the phrase (2~3 syllables) and single char (one syllable). Many database recording voice in phrase but not sentences, and the model just needs to find the phases/chars for everyone, two, or three syllables. In Chinese, we only have 5k^3+5k*2+5k combination, that’s 125000M, much realistic.

A voice-phonetics model

Actually, we don’t need that much sample too. Because those 5k chars only share the same 1500 pronunciation. if we ask our model not to identify different chars, but different phrase pronunciations, we can easily reduce the combination of samples down to 1.5k^3+1.5k*2+1.5k = 3375M.

How can we further reduce the samples we need? We break the sentences into character, each character only had one syllable, that’s our advantage on Mandarin have - the model should be easily split clip by looking at the pause and level of the input voice. So I believe the minimum we only need 1.5k samples, in fact.


Example,

Take this sentence as an example:

世界上最好的瀏覽器 (The best browser in the world)

If we want Our machine to listent to the whole sentences, and find it’s identity characters, that is -

世 界 上 最 好 的 瀏 覽 器

We need the machine to learn all the combination of all the characters, that’s when we need 5k^20 samples. Not gonna work.

How can we reduce it?

Phrase based voice-phonetics model

We train machines to identify the pronunciation of phrases, but not Chinese characters. We already know Mandarin had around 1500 different pronunciation. so if we consider only 2 char length phrases, our model only need to output this result,

  • Shìjiè shàng zuì hǎo de liúlǎn qì (Hanyu Pinyin, you can get this by copy-paste the whole sentences into Google Translate)
  • ㄕˋㄐㄧㄝˋ / ㄕㄤˋ / ㄗㄨㄟˋㄏㄠˇ / ㄉㄜ˙ / ㄌㄧㄡˊㄌㄢˇ / ㄑㄧˋ (Zhuyin)
  • g4ur,4 / g;4 / yjo4cl3 / 2k7 / xu.6x03 / fu4 (Zhuyin rep. by keyboard char)

People who had this output can use pronunciation-chars dictionary table, another model, or other methodology to figure out which words are using in the sentences, and have the final result,

世界 / 上 / 最 / 好 / 的 / 瀏覽 / 器

The problem now looks much much similar to English and other roman-character based languages. We only need 1.5k*1.5k samples, that’s 225M samples.

Character based voice-phonetics model

We can even reduce the data we need, because we actually only need our model to identity chars, but not phrases. The output we really need is,

  • Shì jiè shàng zuì hǎo de liú lǎn qì (Hanyu Pinyin)
  • ㄕˋ / ㄐㄧㄝˋ / ㄕㄤˋ / ㄗㄨㄟˋ / ㄏㄠˇ / ㄉㄜ˙ / ㄌㄧㄡˊ / ㄌㄢˇ / ㄑㄧˋ (Zhuyin)
  • g4 / ru,4 / g;4 / yjo4 / cl3 / 2k7 / xu.6 / x03 / fu4 (Zhuyin keyboard char)

And the same, people can easily convert the output back to the final sentences,

世 界 上 最 好 的 瀏 覽 器


Although we have 5k chars, we only have 1500 sounds, that can represent by 26 characters, each with 4 tonnes* (if we based on Pinyin), or 37 chars with 4 tonnes* (base on Zhuyin) to learn. That’s much more similar to English, right?

* some researchers told me that we don’t even need tones in this scenario.

That is also why we can still use 36-key English keyboard to type Chinese, but not a keyboard with a few hundred keys - which really exist in our computer history.


How do we train our model?

In practice, how can we train the model with this pronunciation-based strategy?

We don’t feed the model with the original Chinese sentences. We should convert the sentence into Pinyin / Zhuyin and feed it.

you feed best_browser_in_the_world.mp3

with

  • “g4 ru,4 g;4 yjo4 cl3 2k7 xu.6 x03 fu4” (convert by Zhuyin table)
  • “Shì jiè shàng zuì hǎo de liú lǎn qì” (convert by Pinyin table)

(I find that this discuss may be useful for others, so I repost it to my Medium)

And yes I should try calc how many identical characters we already have in our sentences, ideally all 5k chars should included. I’ll add this index to my coverage stats script.

I’m moving the messages about numbers of sentences to a new topic so the wikipedia one is just focused on that issue and we can expand this one better.

1 Like