How can I send sentences to contribute?

Codigo_Logo_Programacao_e_Inteligencia_Artificial · September 5, 2018, 12:27pm

Hi, I’m looking forward to the Portuguese language, I wanna contribute, how can I do?
It seems like we need 5000 sentences, that wouldn’t be hard, there is a corpus with 130k sennteces for portuguese, developed by a speech research team, which we could use. For informal language we could write our own sentences. I’d like to see Spanish, I’m not a spanish speaker, but I think we need to focus on main languages, we don’t have an equivalent of Librivox, Tedx for those languages. Also do you guys have plans to add the Spoken wikipedia to the datasets page?

lissyx · September 5, 2018, 12:32pm

I’m sure @reuben can give some hints on portuguese :). BTW, please make sure your 130k dataset can be licensed as CC-0 otherwise it cannot be used for Common Voice.

We have some french contributor hacking tooling exactly on that purpose, being able to extract CC-0 compatible Wikipedia content whatever the language (right now, french, but he’s willing to expand it).

Codigo_Logo_Programacao_e_Inteligencia_Artificial · September 5, 2018, 12:35pm

Well, so we can use the wikipedia right? Ok, I’ll look forward, with regards to informal language, we could write some sentences to get 5000 and then get started.

lissyx · September 5, 2018, 12:36pm

Wikipedia should not be your only component of the dataset, but it can be a part of it. Please ensure you only extract CC-0 content.

lissyx · September 5, 2018, 12:36pm

Here is his code (tested only on french, so far): https://github.com/jeanbaptisteb/commonvoice-fr/blob/master/Wikipedia_CC0.py

J-b · September 5, 2018, 2:15pm

As @lissyx said above, there’s a script to extract content from Wikipedia under the CC0 licence. I still need to fix a couple of bugs on it, and maybe to make it simpler to use.

I’ll try to check this week-end if it’s possible to extract content from the Wikipedia in Portuguese with this script. But feel free to check it directly if you’re comfortable with Python’s scripts and if you have the time to!

As for using the script, I currently strongly recommend to use the parameter “type=‘creation’”, and avoid using the parameter “type=‘all_content’”. If you use the “type=‘all_content’” parameter, the script may retrieve non-CC0 content, which is incompatible with the Common Voice licence. This is a bug I need to fix.

If you have any about question about the script, feel free to send me a direct message. I won’t have a lot of time to dedicate to the project in the coming weeks, but I’ll answer your questions as soon as I can.

nukeador · September 5, 2018, 3:18pm

Please keep this efforts collecting sentences under public domain, as soon as we have our sentence collection tool ready we should be able use it to submit, validate and review them so they can be incorporated in the database.

nukeador · September 5, 2018, 3:28pm

I’ve created this topic to centralize questions

Topic		Replies	Views
📖 Readme: How to see my language on Common Voice Common Voice announcements	35	14427	May 10, 2022
Spanish dataset Common Voice sentence-collection	17	3118	April 3, 2019
Where should I go to contribute new sentences? Common Voice sentence-collection	3	1449	September 5, 2018
Polish language ready to recording and reviewing recordings Common Voice participation , learning , sentence-collection	3	1446	August 26, 2019
Spoken language vs written language in Tamil Common Voice sentence-collection	9	2946	November 1, 2019

How can I send sentences to contribute?

Related topics