The Thai community has created a sample based on recommendations and have checked it, the results are here:
There’s an unusually high number of sentences that are grammatically incorrect due to the challenges of the Thai language which doesn’t have spaces between words but is used as a combination of comma and period, therefore segmentation texts into sentences is an ongoing challenge. The text itself is mostly readable with words in the correct order relative to one another, but one may find the sentences somewhat incomplete, which is why they’ve been marked grammatically incorrect.
I would like to start a discussion as to how this corpus can be bulk loaded into common voice.
Is it possible if you could clarify if the text from Kapook.com is public domain ? We may need to review the agreement with the legal team as Common voice is licensed under CC-0.
For manual QA we’re looking for less than 5% of error rate on the random sample. You can use this tool with a confidence level of 99% and a margin of error of 2% to determine the sample size you need to review.
Once the review is complete, submit a pull request with the # of sentences submitted, a link to the manual QA results, and the % error rate. Here’s an example PR.
The founder of Kapook has personally authorised the sentences are available for use in the Mozilla common voice project, and extracted sentences are licensed under the CC0 license.
I’ve made a pull request but have temporarily rescinded it in order to improve extraction technique and output, will resubmit again later.