Spanish dataset


(Mar Martinez) #1


I would like to contribute to start collecting spanish voices, how can I do it?.
How could I contribute, for example, with sentences (0/5000)?.

Best regards,

(Carlos Fonseca) #2

I was about to ask the same, how do we enable Spanish in the portal, I can can contribute too :slight_smile:

(Carlos Fonseca) #3

(Mar Martinez) #4

Yes, thanks Carlos.
The step 3, the one I am concerned about, is… still blocked?

How did the other languages in use managed? are there alternatives?

(Rubén Martín) #5


We are still finishing the sentence collection tool:

Ideally we would have a beta version to test before the end of the year, if the QA of that beta is satisfactory we can start using it to collect and review sentences.

And I have to say I understand it’s frustrating but please, keep collecting sentences so we can submit them through the tool as soon as it’s ready.

Gracias por vuestras paciencia :wink:

(Fatimaig) #6

And where do we contribute with spanish sentences?
Me and some colleges from the University would like to colaborate in the Spanish part of this project.
Ty :wink:

(Rubén Martín) #7

You can start collecting sentences with public domain license anytime and use any form to store them in the meantime. As soon as the tool is ready you will be able to submit them for peer-review and approval.

Please, note that we want the sentence collection tool to enforce some hard requirements that are necessary for sentences to be useful for the machine learning algorithm:

  • Numbers. There should be no digits in the source text because they can cause problems when read aloud. The way a number is read depends on context and might introduce confusion in the dataset. For example, the number “2409” could be accurately read as both “twenty-four zero nine” and “two thousand four hundred nine”.
  • Abbreviations and Acronyms. Abbreviations and acronyms like “USA” or “ICE” should be avoided in the source text because they may be read in a way that does not coincide with their spelling. Additionally, there may be multiple accurate readings for a single abbreviation. For example, the acronym “ICE” could be pronounced “I-C-E” or as a single word.
  • Punctuation. Special symbols and punctuation should only be included when absolutely necessary. For example, an apostrophe is included in English words like “don’t” and “we’re” and should be included in the source text, but it’s unlikely you’ll ever need a special symbol like “@” or “#.”
  • Foreign letters. Letters must be valid in the language being spoken. For example, “ж” is a letter in the Russian alphabet but is never used in English and so should never appear in any English source text.