Hi,
I have received permission from Kapook.com to extract sentences from their text for use in Mozilla Common Voice
The entire text is here: https://gist.github.com/mishari/4308887f2f8bc1cd6d7c6e07a68d08f4 with 60k entries
The Thai community has created a sample based on recommendations and have checked it, the results are here:
There’s an unusually high number of sentences that are grammatically incorrect due to the challenges of the Thai language which doesn’t have spaces between words but is used as a combination of comma and period, therefore segmentation texts into sentences is an ongoing challenge. The text itself is mostly readable with words in the correct order relative to one another, but one may find the sentences somewhat incomplete, which is why they’ve been marked grammatically incorrect.
I would like to start a discussion as to how this corpus can be bulk loaded into common voice.
cc @bact
Best regards
Mishari