:triangular_flag_on_post: This information is also now available on the About Pages on Common Voice Website. Please help us to localise this by joining Pontoon :open_book: Mozilla Voice Community Playbook : The source of truth for setting up and maintain self-sustainable communities. Hello ever…

That’s awesome to hear! Atm I still have to run a script & redeploy to make a language available. I’ll try to get that done this week.

Salute! Pote io traducer iste communication in interlingua? Gratias. Vos pote responder in: anglese, italiano, espaniol, francese o portugese. Gratias.

Since the bn-Wikipedia scraping is on hold for the rust-punkt issue. Meanwhile, I would like to know about the legality of using sentences from other public domain sources for the mainly for the Bengali languge such as: [image] OSCAR Open Source Project on Multilingual Resources for Mach…

OSCAR was actually already suggested in the past, but for legal reasons a decision has been made not to utilise it. See Using OSCAR corpus sentences . I would have to invest a bit more time into checking the second source you provided to be able to provide some valid input regarding that, but at fir…

Yeah its an already completed dataset. Can’t it be used as a test data set?

Sure it can, but it should not become part of common voice. You can link to it from common voice, but including the data into common voice would only cause confusion and potentially data duplication for multi-dataset applications.

@Adrijaned Is wikisource a good source ? I see that public domain means different in different countries, Can you clarify if the text is public domain in US can be used for common voice ? This can be a good source, https://bn.wikisource.org/wiki/লেখক:রবীন্দ্রনাথ_ঠাকুর but adding them manually throug…

WikiSource can be a good source, and I agree that manually pulling sentences from there is tedious (been there, done that). If you could make a script to automatically extract viable sentences from WikiSource, then those could in theory go through the same validation process as Wikipedia extractions…

[image] Adrijaned: ource, and I agree that manually pulling sentences from there is tedious (been there, done that). If you could make a script to automatically extract viable sentences from WikiSource, then those could in theory go through the same validation process as Wikipedia extractions. S…

As you can see from the discussions here the issue is that the rust-punk tokenizer is not even trained for languages like Bengali, it falls back to English for tokenizing, as explained by @Arijit_Mukherjee before. Until and unless a new tokenizer is introduced with wider support Wikipedia extraction…

📖 Readme: How to see my language on Common Voice

Common Voice

Patsun (Patsun) January 26, 2021, 10:08pm 34

Did all of that back in 2018 and still waiting to get a response from the moderators.

Topic		Replies	Views
Extending our sentence collection capabilities Common Voice sentence-collection , announcements	19	3714	September 11, 2019
We want your feedback: Improving the sentence collection Common Voice sentence-collection , feedback	39	8920	January 9, 2019
Sentence collection tool development topic Common Voice sentence-collection , announcements	32	4061	January 26, 2019
Problems finding public domain sentences Common Voice sentence-collection	26	2996	June 10, 2019
I'm almost giving up on the project. Feedback from a big contributor (10000 sentences sent, 7000 listened) Common Voice	24	2278	March 15, 2023

📖 Readme: How to see my language on Common Voice

Related topics