Thai (th) rules for validation and cleanup, for discussion

bact · August 30, 2020, 7:59pm

I’ve add validation (filter) and cleanup (normalise) rules for Thai language (th - ภาษาไทย). Comments are welcome.

Validation rules https://github.com/Common-Voice/sentence-collector/issues/318
Cleanup rules https://github.com/Common-Voice/sentence-collector/issues/324

There are some areas that need discussion.

Like, should we reject a sentence with wrong logical order of characters (but visually it could be rendered correctly on screen)? Or if it is very obvious on how to fix it, could we accept it first (let it pass the filter) and then later clean it up with some normalize function? (this question also applies for other language as well)

Also noted that, currently, the cleanup function is used only by exporter and the order of call is validator first then follow by cleanup.

bact · August 30, 2020, 10:58pm

For example, the current validator will reject this sequence

เ + เ (Sara E \u0E40 + Sara E \u0E40)
as it is a wrong way to writes:
แ (Sara Ae \u0E41)

But it is also possible to let the sequence \u0E40\u0E40 pass in validator, and later convert it to \0E41 in cleanup, as this pattern is obvious and there is no other interpretation in Thai language.

Comments are very welcome. Thank you.

bact · April 16, 2021, 6:09am

Update: The rules got merged and up running in the Sentence Collector for a while.

More effort for Thai language is now in the Sentence Extractor, with lots of rules borrowed from ones in Sentence Collector.

See the initial PR here:

Currently doesn’t work very well for a language written without a period at the end of the sentence. Got a very low number of sentences. But the quality is quite good overall, 88% sentences are OK from a native speaker’s review.