📖 Mandarin sentences from wikipedia

There is one special limit/feature on zh-tw and zh-cn Wikipedia - they share the same article data, when people visit the article, it converts into zh-tw and zh-cn on the fly.

How this will affect our plan?

There are both Taiwan and China Wikipedia editors, so the Chinese Wikipedia articles are a combination of phrases and sentences pattern used in Taiwan and China. It’s not a problem for people who look for knowledge (they just need to ignore the weird feeling and focus on the info), but it’s a problem for us.

Most of the sentences will both be weird for China user and Taiwanese user. The experiences will be pretty bad when we ask people from Taiwan to read out sentences from China editor and ask China user to read sentences from Taiwanese editor.

How can we solve it?

All of the sentences we fetch from Wikipedia should also go through a manual review and edit (by volunteers from both sides, into the database of both versions).

I don’t know if I really understand the problem.

  • Do you mean that articles written originally in zh-CN are converted to zh-TW and the other way around?
  • And that’s a problem because the results are not natural?
  • Is there a way to know which ones so we just pick the originals in zh-CN?

@george Is still doing the planning for the wikipedia effort, my understanding is that we want to do a couple of iterations where we extract a 1000 or so, do a quick skimming with native speakers and repeat to understand the error ratio, but at the end we won’t be able to manually review 2M sentences or more, we will have to work with an approximation.

The Chinese Wikipedia article are written in a MIX of zh-CN and zh-TW by editor from both places, that is, you will have zh-cn sentences and zh-tw sentences in the same article, and some case, zh-cn words and zh-tw words in single sentences.

Each article has a conversion table, and Wikipedia had a dozen of general / topics specific conversion table, they use them with Simplified / Traditional character converter to display the correct language on the fly.

So the final result will have a mix senses of Taiwanese and China “smell”, and not like natural sentences we talk in our daily talk.

Chinese is one of the most special languages on Wikipedia that has this mechanic. Because they want “more article” in Chinese in general then more fluency reading experience.

6 posts were split to a new topic: Number of sentences needed

@rosana @george Flagging this as a potential risk

Thanks for this flag and yikes.

@irvin Can you or others think of ways to automatically identify non-mixed sentences, if they exist? Are there tags and such that help identify what’s what within an article to avoid problematic sentences?

(Note that we need to use only 1-3 sentences per article.)

Create or find a zh-CN language model and reject sentences/phrases below some threshold of probability.

I don’t know the way to automatically identify them.

I’m still thinking that if we don’t need that many sentences, a manual filter in the final stage is still the best way to make sure a good quality.

@irvin We will need that many sentences for the zh-CN corpus we are building.

The technique I mentioned above would automatically identify such sentences.

The only remaining problem is creating or finding a zh-CN language model.

Creating a zh-CN language model would require that we had a large zh-CN text. This seems like the problem we are trying to solve, but it may be possible to find a large zh-CN text that’s not CC-0 but would allow us to create and use a language model created from it. So this does not degenerate into our original problem.

Finding a zh-CN language model may be easier, though I haven’t spent any significant amount of time searching for such a zh-CN language model online.

Just so I understand the scope of the problem. Is this akin to UK English vs American English? In other words the translations give text that, broadly, would be considered fine in both languages, but will occasionally, color vs colour, give words and phrases which are awkward.

Is the problem more complicated than a distinction between “Traditional” vs. “Simplified” sentences?

You’ve said the conversion used on Wikipedia is between “zh-CN” and “zh-TW”, but I’m wondering if the conversion is just a “Traditional --> Simplified” look-up table.

If so, we can identify sentences with characters which are “only-Simplified” or “only-Traditional”
 right?

What do you think, @irvin ?

1 Like

Hi Kelly,

I had just re-write the earlier post, to explain why I think it’s better to build the model automatically identify phonetics but not characters is a much easier approach and need far far less data - thousand millions vs. thousands based on our knowledge of Chinese computer and Chinese input method,

Consider that even if we got that big sentences data, we can’t have that many people recording millions of minutes, so small corpus can really help.

Chinese Wikipedia is written and edited by both Taiwanese, China, Hong Kong, and other volunteers, in both traditional and simplified Chinese, together.

So the source sentences in the article is a random mix of zh-tw and zh-cn characters and phrases.

I’m not sure if it’s easy to distinguish one sentences are originally zh-cn or zh-tw, we may use a common character table from both Taiwan and China to estimate it.

Of course, we could convert all characters back to zh-tw or zh-cn at once, but sentences structure and phrases are different across the region. Although people can understand and able to read them, they may have hesitated when reading, or read with a “foreign” accent (which I feel is a natural reaction)

So I still feel it’s better to manual review and make sure the phrases and grammar are localization good.

@irvin While the core of your post “there are too many characters” is sound. It’s not relevant to what we are doing. We are going to output to a 256 dimensional softmax using a UTF-8 character encoding.

Phonemes are nice but they are language dependent, a single phoneme is spoken slightly differently in different languages. Also, different languages use different numbers of phonemes, so the model architecture would have to change from language to language.

Also using phonemes requires that we have a bi-directional phonetic dictionary for each language we tackle. For German or English this is not a big problem. However, for smaller languages this would exclude them from ever having a speech-to-text engine as we, or someone, would have to find a linguist who would have to write a phonetic dictionary for the language. Then, as the dictionary would be incomplete, think place names, we’d have to train a phoneme-to-grapheme model with the dictionary to deal with words not covered in the dictionary. The phoneme-to-grapheme model itself would be another model which has to be created, trained, maintained, deployed, debugged


These problems are handled with just audio + transcripts in our approach.

1 Like

I gave a solution for this above.

Agree, I’m thinking only on the problem of “How can we have the good enough Chinese STT model, and get the amount of data we need by this year”, but not considering other languages.

If you are talking about the output of the model, yes it’s ok to output text in one text and converts to another. There is no spelling problem such as color vs colour in this scenario.

On the character level, one zh-tw character has one opposite zh-cn character, but one zh-cn character had many zh-tw opposite characters. However, we do have some dictionaries to fix this kind of problem.

There are two things going on

  • Simplified Chinese Mandarin collected using Common Voice
  • Contracted mechanical turk like simplified Chinese Mandarin data set

The second will be done on a short time scale as we need the data for MR’s speech recognition.

A minute or two of searching found that there is one here that is part of Ubuntu. So we know how to proceed.

Did you means we are going to contract people and ask them to record on simplified version of common voice?

(I’m fine if we are, it should be cheaper than just purchase the available db and deal with legal problem)

Yes, this is the project we talked about @rosana was leading + having the community to also contribute to be able to cover the voice diversity percentage we need.

1 Like