I have received permission from Kapook.com to extract sentences from their text for use in Mozilla Common Voice

The entire text is here: https://gist.github.com/mishari/4308887f2f8bc1cd6d7c6e07a68d08f4 with 60k entries

The Thai community has created a sample based on recommendations and have checked it, the results are here:

There’s an unusually high number of sentences that are grammatically incorrect due to the challenges of the Thai language which doesn’t have spaces between words but is used as a combination of comma and period, therefore segmentation texts into sentences is an ongoing challenge. The text itself is mostly readable with words in the correct order relative to one another, but one may find the sentences somewhat incomplete, which is why they’ve been marked grammatically incorrect.

I would like to start a discussion as to how this corpus can be bulk loaded into common voice.

Hi @mishari,

Is it possible if you could clarify if the text from Kapook.com is public domain ? We may need to review the agreement with the legal team as Common voice is licensed under CC-0.

Regarding bulk submission please check out this guide that explains how to do bulk submissions: https://github.com/common-voice/common-voice/blob/main/docs/SENTENCES.md#bulk-submission

For manual QA we’re looking for less than 5% of error rate on the random sample. You can use this tool with a confidence level of 99% and a margin of error of 2% to determine the sample size you need to review.

Once the review is complete, submit a pull request with the # of sentences submitted, a link to the manual QA results, and the % error rate. Here’s an example PR.

Hi Hillary,

The founder of Kapook has personally authorised the sentences are available for use in the Mozilla common voice project, and extracted sentences are licensed under the CC0 license.

I’ve made a pull request but have temporarily rescinded it in order to improve extraction technique and output, will resubmit again later.

