How can I send sentences to contribute?


(Pedro Lima) #1

Hi, I’m looking forward to the Portuguese language, I wanna contribute, how can I do?
It seems like we need 5000 sentences, that wouldn’t be hard, there is a corpus with 130k sennteces for portuguese, developed by a speech research team, which we could use. For informal language we could write our own sentences. I’d like to see Spanish, I’m not a spanish speaker, but I think we need to focus on main languages, we don’t have an equivalent of Librivox, Tedx for those languages. Also do you guys have plans to add the Spoken wikipedia to the datasets page?


(Lissyx) #2

I’m sure @reuben can give some hints on portuguese :). BTW, please make sure your 130k dataset can be licensed as CC-0 otherwise it cannot be used for Common Voice.

We have some french contributor hacking tooling exactly on that purpose, being able to extract CC-0 compatible Wikipedia content whatever the language (right now, french, but he’s willing to expand it).


(Pedro Lima) #3

Well, so we can use the wikipedia right? Ok, I’ll look forward, with regards to informal language, we could write some sentences to get 5000 and then get started.


(Lissyx) #4

Wikipedia should not be your only component of the dataset, but it can be a part of it. Please ensure you only extract CC-0 content.


(Lissyx) #5

Here is his code (tested only on french, so far): https://github.com/jeanbaptisteb/commonvoice-fr/blob/master/Wikipedia_CC0.py


(Jean-Baptiste Bertrand) #6

As @lissyx said above, there’s a script to extract content from Wikipedia under the CC0 licence. I still need to fix a couple of bugs on it, and maybe to make it simpler to use.

I’ll try to check this week-end if it’s possible to extract content from the Wikipedia in Portuguese with this script. But feel free to check it directly if you’re comfortable with Python’s scripts and if you have the time to!

As for using the script, I currently strongly recommend to use the parameter “type=‘creation’”, and avoid using the parameter “type=‘all_content’”. If you use the “type=‘all_content’” parameter, the script may retrieve non-CC0 content, which is incompatible with the Common Voice licence. This is a bug I need to fix.

If you have any about question about the script, feel free to send me a direct message. I won’t have a lot of time to dedicate to the project in the coming weeks, but I’ll answer your questions as soon as I can.


(Rubén Martín) #7

Please keep this efforts collecting sentences under public domain, as soon as we have our sentence collection tool ready we should be able use it to submit, validate and review them so they can be incorporated in the database.


(Rubén Martín) #8

I’ve created this topic to centralize questions