I’ve add validation (filter) and cleanup (normalise) rules for Thai language (th - ภาษาไทย). Comments are welcome.
- Validation rules https://github.com/Common-Voice/sentence-collector/issues/318
- Cleanup rules https://github.com/Common-Voice/sentence-collector/issues/324
There are some areas that need discussion.
Like, should we reject a sentence with wrong logical order of characters (but visually it could be rendered correctly on screen)? Or if it is very obvious on how to fix it, could we accept it first (let it pass the filter) and then later clean it up with some normalize function? (this question also applies for other language as well)
Also noted that, currently, the cleanup function is used only by exporter and the order of call is validator first then follow by cleanup.