Spanish dataset

mar_martinez · December 14, 2018, 6:20pm

Hi,

I would like to contribute to start collecting spanish voices, how can I do it?.
How could I contribute, for example, with sentences (0/5000)?.

Best regards,
Mar

carlfm01 · December 14, 2018, 9:47pm

I was about to ask the same, how do we enable Spanish in the portal, I can can contribute too

carlfm01 · December 15, 2018, 7:23am

mar_martinez · December 15, 2018, 9:35am

Yes, thanks Carlos.
The step 3, the one I am concerned about, is… still blocked?

How did the other languages in use managed? are there alternatives?

nukeador · December 17, 2018, 5:13pm

Hi,

We are still finishing the sentence collection tool:

Ideally we would have a beta version to test before the end of the year, if the QA of that beta is satisfactory we can start using it to collect and review sentences.

And I have to say I understand it’s frustrating but please, keep collecting sentences so we can submit them through the tool as soon as it’s ready.

Gracias por vuestras paciencia

fatimaig · December 26, 2018, 5:40pm

And where do we contribute with spanish sentences?
Me and some colleges from the University would like to colaborate in the Spanish part of this project.
Ty

nukeador · December 26, 2018, 5:45pm

You can start collecting sentences with public domain license anytime and use any form to store them in the meantime. As soon as the tool is ready you will be able to submit them for peer-review and approval.

Please, note that we want the sentence collection tool to enforce some hard requirements that are necessary for sentences to be useful for the machine learning algorithm:

Numbers. There should be no digits in the source text because they can cause problems when read aloud. The way a number is read depends on context and might introduce confusion in the dataset. For example, the number “2409” could be accurately read as both “twenty-four zero nine” and “two thousand four hundred nine”.
Abbreviations and Acronyms. Abbreviations and acronyms like “USA” or “ICE” should be avoided in the source text because they may be read in a way that does not coincide with their spelling. Additionally, there may be multiple accurate readings for a single abbreviation. For example, the acronym “ICE” could be pronounced “I-C-E” or as a single word.
Punctuation. Special symbols and punctuation should only be included when absolutely necessary. For example, an apostrophe is included in English words like “don’t” and “we’re” and should be included in the source text, but it’s unlikely you’ll ever need a special symbol like “@” or “#.”
Foreign letters. Letters must be valid in the language being spoken. For example, “ж” is a letter in the Russian alphabet but is never used in English and so should never appear in any English source text.

mar_martinez · January 25, 2019, 12:40pm

Hi,

Any update about the sentence collection tool availability for Spanish?

Thanks,
Mar

nukeador · January 25, 2019, 12:46pm

We plan to launch the beta version of the tool next week.

You can follow the development in Sentence collection tool development topic

mar_martinez · January 30, 2019, 6:18pm

Hi,

The sentence collection tool is ready and now I am adding and reviewing sentences in Spanish. Great.

But the Common Voice site localization in Spanish is never ending, but indeed it is getting worse (from 95% completion in December to 75% now). Where can I fix this?. Directly with a github pull request or any other tool?.

Tanks,
Mar

nukeador · January 30, 2019, 6:31pm

Website localization is handled via pontoon:

You can ping people with reviewer rights in this telegram channel or this forum.

mar_martinez · January 30, 2019, 7:25pm

Thanks a lot,

On the other hand, to fulfill the accent list, I assume that is required a github pull request, the required list will be roughly by countries (e.g. Español de México, Español de Guatemala, Español de España…) or with local accents inside each country?.

Regards,
Mar

nukeador · January 30, 2019, 8:34pm

We should probably follow and official list of accents. @josh_meyer how did we get the English one?

josh_meyer · January 30, 2019, 10:29pm

I dont know how we got English accents, but I’ll ask around.

nukeador · February 1, 2019, 1:02am

@mar_martinez @carlfm01 @fatimaig I see a lot of activity for Spanish today

Just noticed an old book with weird old language and some subtitles that have incomplete sentences, I hope community vote negative but we should probably warn people before uploading thousands of sentences without even checking them.

I’ve added a few tips on how to get a lot of valid sentences reusing Catalan previous work here

daniel.cruzado · April 3, 2019, 7:43am

Hi, I have seen that Spanish already has 14 hours, that is far more than for example than Breton or Irish, but Spanish dataset is not available for download.

Do we know when will it be ready?

nukeador · April 3, 2019, 10:19am

Please read this explanation about the current dataset release process:

daniel.cruzado · April 3, 2019, 10:56am

Ok, thanks a lot for your answer and for all of your work!!

Topic		Replies	Views
Languages addressed Common Voice	24	3897	May 15, 2018
📖 Readme: How to see my language on Common Voice Common Voice announcements	35	14459	May 10, 2022
How can I send sentences to contribute? Common Voice sentence-collection	7	2020	September 5, 2018
Common Voice Sentence Collection Tool launch Common Voice sentence-collection , announcements	14	4322	March 27, 2019
Where should I go to contribute new sentences? Common Voice sentence-collection	3	1451	September 5, 2018

Spanish dataset

Related topics