Add Portuguese for voice collection

Hi, I’ve validated about 6 thousand sentences in Portuguese in the sentence collection tool, but it wasn’t added, so are the any chances that we might get an import in the Wikipedia for Portuguese?

The issue we have right now is that we need to fix the current variants we have for Portuguese.

The website was enabled in pt-BR on pontoon, and there is technical limitation since the “pt” locale doesn’t exist on pontoon.

Based on the proposal we shared about languages, it seems we are looking for having a single Portuguese (like a single Spanish, French or English we have today).

Let me see with the dev team if at least we can fix that so the sentences from the collector are added.

About wikipedia, we are still formalizing the process and improving the process, at some point the tool should be ready for communities to create extraction rules we can apply, but we are not there yet.

Thanks. I’ll see what a can do to add more sentences in CC0, maybe translating from the main repo manually. Is there a way in which those sentences would be prioritized?

In general translating sentences is not encouraged, since can result in non natural native ones and restrict the diversity of expressions.

Let me see if we can fix the Portuguese locale issue and I’ll let you know.

Okay… Thanks again.

Quick update: We have identified all changes needed to enable Portuguese, we’ll apply them as soon as we have a slot.

Portuguese (pt) should be now working on the site

The next steps will be to do final checks on the number of sentences collected and do the proper export to the site.

1 Like

We have exported the pending 8K+ sentences from the collector to the site, in the coming days it should be enabled for voice collection.

Please read this topic for reference:

We’ll need your help to keep getting a lot more sentences to accommodate more voice hours.

Thanks for your contributions!

@nukeador Hi Thanks for doing this, unfortunately some troll sent some sentences that got to these sentences, sentence number 6006 was definitely a troll, sentence 8563 is very vulgar/offensive, I will keep an eye on it and report more later, meanwhile could you remove those?

Also sentences number: 8494 and 8497 are very offensive.

Sentences number 8422, 8425 and 8427 should also be removed.

Thanks, these sentences have been removed from the Sentence Collector and will be removed from the voice-web repo by the following Pull Request:

How can I access voice collection data for portuguese?

Voice collection data is released a few times during the year on this page

Since Portuguese has been just enabled, we’ll have to wait until the next dataset release (we are woking on defining a fixed schedule for this).


Any expected date? I’m eagerly awaiting to get my hands on a pt_BR dataset to try on DeepSpeech.


Unfortunately we have just on-boarded a new developer to cover the previous one leaving so we don’t have the bandwidth yet to focus on a new dataset release based on other priorities we need to finish.

We’ll provide more details on our current focus in a project update this week.