Kapook.com Thai language sentence submission for inclusion

mishari · June 21, 2021, 10:30am

Hi,

I have received permission from Kapook.com to extract sentences from their text for use in Mozilla Common Voice

The entire text is here: https://gist.github.com/mishari/4308887f2f8bc1cd6d7c6e07a68d08f4 with 60k entries

The Thai community has created a sample based on recommendations and have checked it, the results are here:

There’s an unusually high number of sentences that are grammatically incorrect due to the challenges of the Thai language which doesn’t have spaces between words but is used as a combination of comma and period, therefore segmentation texts into sentences is an ongoing challenge. The text itself is mostly readable with words in the correct order relative to one another, but one may find the sentences somewhat incomplete, which is why they’ve been marked grammatically incorrect.

I would like to start a discussion as to how this corpus can be bulk loaded into common voice.

cc @bact

Best regards
Mishari

heyhillary · June 23, 2021, 3:26pm

Hi @mishari,

Is it possible if you could clarify if the text from Kapook.com is public domain ? We may need to review the agreement with the legal team as Common voice is licensed under CC-0.

Regarding bulk submission please check out this guide that explains how to do bulk submissions: https://github.com/common-voice/common-voice/blob/main/docs/SENTENCES.md#bulk-submission

For manual QA we’re looking for less than 5% of error rate on the random sample. You can use this tool with a confidence level of 99% and a margin of error of 2% to determine the sample size you need to review.

Once the review is complete, submit a pull request with the # of sentences submitted, a link to the manual QA results, and the % error rate. Here’s an example PR.

mishari · July 14, 2021, 1:48pm

Hi Hillary,

The founder of Kapook has personally authorised the sentences are available for use in the Mozilla common voice project, and extracted sentences are licensed under the CC0 license.

I’ve made a pull request but have temporarily rescinded it in order to improve extraction technique and output, will resubmit again later.

Best regards
Mishari

Topic		Replies	Views
📖 Readme: How to see my language on Common Voice Common Voice announcements	40	14280	May 10, 2022
Add Thai Language For Common Voice Common Voice	2	468	August 26, 2020
Polish sentences concerns Common Voice sentence-collection , issue , dataset	20	3298	May 4, 2020
Sentence collector copyright issues Common Voice sentence-collection	54	6231	April 16, 2024
Common Voice New Sentence Collector Common Voice	15	1006	August 12, 2023

Kapook.com Thai language sentence submission for inclusion

Related topics