There is one special limit/feature on zh-tw and zh-cn Wikipedia - they share the same article data, when people visit the article, it converts into zh-tw and zh-cn on the fly.
How this will affect our plan?
There are both Taiwan and China Wikipedia editors, so the Chinese Wikipedia articles are a combination of phrases and sentences pattern used in Taiwan and China. Itâs not a problem for people who look for knowledge (they just need to ignore the weird feeling and focus on the info), but itâs a problem for us.
Most of the sentences will both be weird for China user and Taiwanese user. The experiences will be pretty bad when we ask people from Taiwan to read out sentences from China editor and ask China user to read sentences from Taiwanese editor.
How can we solve it?
All of the sentences we fetch from Wikipedia should also go through a manual review and edit (by volunteers from both sides, into the database of both versions).
I donât know if I really understand the problem.
Do you mean that articles written originally in zh-CN are converted to zh-TW and the other way around?
And thatâs a problem because the results are not natural?
Is there a way to know which ones so we just pick the originals in zh-CN?
@george Is still doing the planning for the wikipedia effort, my understanding is that we want to do a couple of iterations where we extract a 1000 or so, do a quick skimming with native speakers and repeat to understand the error ratio, but at the end we wonât be able to manually review 2M sentences or more, we will have to work with an approximation.
The Chinese Wikipedia article are written in a MIX of zh-CN and zh-TW by editor from both places, that is, you will have zh-cn sentences and zh-tw sentences in the same article, and some case, zh-cn words and zh-tw words in single sentences.
Each article has a conversion table, and Wikipedia had a dozen of general / topics specific conversion table, they use them with Simplified / Traditional character converter to display the correct language on the fly.
So the final result will have a mix senses of Taiwanese and China âsmellâ, and not like natural sentences we talk in our daily talk.
Chinese is one of the most special languages on Wikipedia that has this mechanic. Because they want âmore articleâ in Chinese in general then more fluency reading experience.
@irvin Can you or others think of ways to automatically identify non-mixed sentences, if they exist? Are there tags and such that help identify whatâs what within an article to avoid problematic sentences?
(Note that we need to use only 1-3 sentences per article.)
I donât know the way to automatically identify them.
Iâm still thinking that if we donât need that many sentences, a manual filter in the final stage is still the best way to make sure a good quality.
@irvin We will need that many sentences for the zh-CN corpus we are building.
The technique I mentioned above would automatically identify such sentences.
The only remaining problem is creating or finding a zh-CN language model.
Creating a zh-CN language model would require that we had a large zh-CN text. This seems like the problem we are trying to solve, but it may be possible to find a large zh-CN text thatâs not CC-0 but would allow us to create and use a language model created from it. So this does not degenerate into our original problem.
Finding a zh-CN language model may be easier, though I havenât spent any significant amount of time searching for such a zh-CN language model online.
Just so I understand the scope of the problem. Is this akin to UK English vs American English? In other words the translations give text that, broadly, would be considered fine in both languages, but will occasionally, color vs colour, give words and phrases which are awkward.
Is the problem more complicated than a distinction between âTraditionalâ vs. âSimplifiedâ sentences?
Youâve said the conversion used on Wikipedia is between âzh-CNâ and âzh-TWâ, but Iâm wondering if the conversion is just a âTraditional --> Simplifiedâ look-up table.
If so, we can identify sentences with characters which are âonly-Simplifiedâ or âonly-Traditionalâ⊠right?
I had just re-write the earlier post, to explain why I think itâs better to build the model automatically identify phonetics but not characters is a much easier approach and need far far less data - thousand millions vs. thousands based on our knowledge of Chinese computer and Chinese input method,
Consider that even if we got that big sentences data, we canât have that many people recording millions of minutes, so small corpus can really help.
Chinese Wikipedia is written and edited by both Taiwanese, China, Hong Kong, and other volunteers, in both traditional and simplified Chinese, together.
So the source sentences in the article is a random mix of zh-tw and zh-cn characters and phrases.
Iâm not sure if itâs easy to distinguish one sentences are originally zh-cn or zh-tw, we may use a common character table from both Taiwan and China to estimate it.
Of course, we could convert all characters back to zh-tw or zh-cn at once, but sentences structure and phrases are different across the region. Although people can understand and able to read them, they may have hesitated when reading, or read with a âforeignâ accent (which I feel is a natural reaction)
So I still feel itâs better to manual review and make sure the phrases and grammar are localization good.
@irvin While the core of your post âthere are too many charactersâ is sound. Itâs not relevant to what we are doing. We are going to output to a 256 dimensional softmax using a UTF-8 character encoding.
Phonemes are nice but they are language dependent, a single phoneme is spoken slightly differently in different languages. Also, different languages use different numbers of phonemes, so the model architecture would have to change from language to language.
Also using phonemes requires that we have a bi-directional phonetic dictionary for each language we tackle. For German or English this is not a big problem. However, for smaller languages this would exclude them from ever having a speech-to-text engine as we, or someone, would have to find a linguist who would have to write a phonetic dictionary for the language. Then, as the dictionary would be incomplete, think place names, weâd have to train a phoneme-to-grapheme model with the dictionary to deal with words not covered in the dictionary. The phoneme-to-grapheme model itself would be another model which has to be created, trained, maintained, deployed, debuggedâŠ
These problems are handled with just audio + transcripts in our approach.
Agree, Iâm thinking only on the problem of âHow can we have the good enough Chinese STT model, and get the amount of data we need by this yearâ, but not considering other languages.
If you are talking about the output of the model, yes itâs ok to output text in one text and converts to another. There is no spelling problem such as color vs colour in this scenario.
On the character level, one zh-tw character has one opposite zh-cn character, but one zh-cn character had many zh-tw opposite characters. However, we do have some dictionaries to fix this kind of problem.
Yes, this is the project we talked about @rosana was leading + having the community to also contribute to be able to cover the voice diversity percentage we need.