Thai (th) rules for validation and cleanup, for discussion

I’ve add validation (filter) and cleanup (normalise) rules for Thai language (th - ภาษาไทย). Comments are welcome.

There are some areas that need discussion.

Like, should we reject a sentence with wrong logical order of characters (but visually it could be rendered correctly on screen)? Or if it is very obvious on how to fix it, could we accept it first (let it pass the filter) and then later clean it up with some normalize function? (this question also applies for other language as well)

Also noted that, currently, the cleanup function is used only by exporter and the order of call is validator first then follow by cleanup.

1 Like

For example, the current validator will reject this sequence

  • เ + เ (Sara E \u0E40 + Sara E \u0E40)
    as it is a wrong way to writes:
  • แ (Sara Ae \u0E41)

But it is also possible to let the sequence \u0E40\u0E40 pass in validator, and later convert it to \0E41 in cleanup, as this pattern is obvious and there is no other interpretation in Thai language.

Comments are very welcome. Thank you.

1 Like