Spanish dataset


(Mar Martinez) #1


I would like to contribute to start collecting spanish voices, how can I do it?.
How could I contribute, for example, with sentences (0/5000)?.

(Carlos Fonseca) #2

I was about to ask the same, how do we enable Spanish in the portal, I can can contribute too :slight_smile:

(Carlos Fonseca) #3

(Mar Martinez) #4

The step 3, the one I am concerned about, is… still blocked?

How did the other languages in use managed? are there alternatives?

(Rubén Martín) #5


We are still finishing the sentence collection tool:

Ideally we would have a beta version to test before the end of the year, if the QA of that beta is satisfactory we can start using it to collect and review sentences.

And I have to say I understand it’s frustrating but please, keep collecting sentences so we can submit them through the tool as soon as it’s ready.

(Fatimaig) #6

And where do we contribute with spanish sentences?
Me and some colleges from the University would like to colaborate in the Spanish part of this project.
(Rubén Martín) #7

You can start collecting sentences with public domain license anytime and use any form to store them in the meantime. As soon as the tool is ready you will be able to submit them for peer-review and approval.

Please, note that we want the sentence collection tool to enforce some hard requirements that are necessary for sentences to be useful for the machine learning algorithm:

  • Numbers. There should be no digits in the source text because they can cause problems when read aloud. The way a number is read depends on context and might introduce confusion in the dataset. For example, the number “2409” could be accurately read as both “twenty-four zero nine” and “two thousand four hundred nine”.
  • Abbreviations and Acronyms. Abbreviations and acronyms like “USA” or “ICE” should be avoided in the source text because they may be read in a way that does not coincide with their spelling. Additionally, there may be multiple accurate readings for a single abbreviation. For example, the acronym “ICE” could be pronounced “I-C-E” or as a single word.
  • Punctuation. Special symbols and punctuation should only be included when absolutely necessary. For example, an apostrophe is included in English words like “don’t” and “we’re” and should be included in the source text, but it’s unlikely you’ll ever need a special symbol like “@” or “#.”
  • Foreign letters. Letters must be valid in the language being spoken. For example, “ж” is a letter in the Russian alphabet but is never used in English and so should never appear in any English source text.

(Mar Martinez) #8


Any update about the sentence collection tool availability for Spanish?


(Rubén Martín) #9

We plan to launch the beta version of the tool next week.

You can follow the development in Sentence collection tool development topic

(Mar Martinez) #10


The sentence collection tool is ready and now I am adding and reviewing sentences in Spanish. Great.

But the Common Voice site localization in Spanish is never ending, but indeed it is getting worse (from 95% completion in December to 75% now). Where can I fix this?. Directly with a github pull request or any other tool?.


(Rubén Martín) #11

Website localization is handled via pontoon:

You can ping people with reviewer rights in this telegram channel or this forum.

(Mar Martinez) #12

On the other hand, to fulfill the accent list, I assume that is required a github pull request, the required list will be roughly by countries (e.g. Español de México, Español de Guatemala, Español de España…) or with local accents inside each country?.


(Rubén Martín) #13

We should probably follow and official list of accents. @josh_meyer how did we get the English one?


(Rubén Martín) #15

@mar_martinez @carlfm01 @fatimaig I see a lot of activity for Spanish today :smiley:

Just noticed an old book with weird old language and some subtitles that have incomplete sentences, I hope community vote negative but we should probably warn people before uploading thousands of sentences without even checking them.

I’ve added a few tips on how to get a lot of valid sentences reusing Catalan previous work here