I can't speak sentences in portuguese. There is no phrases for the language

The website says it has no phrases for the language. And from Saverio Morelli’s application it says that the API has no phrases.

Why doesn’t Mozilla pull public domain phrases automatically to solve this annoying problem? :rage:

Hey @maria2, welcome. I feel your pain. The most problematic and “costly” (read resource intensive) part of the dataset creation is building a good text corpus.

I cannot see an easy way of creating a text corpus like you propose - for 100+ languages. Here are some reasons:

  • It must be public domain / CC-0 and this must be checked by humans.
  • Sentences must fit the rules set on the system (length, number of words etc) and must have correct spelling and grammar. And this needs human intervention (some NLP scripting helps here).
  • Text corpus also defines the domain of the AI model, and thus the inference will be working on. It is best to select non-domain-specific sentences for a general-purpose model, which can further be fine-tuned into a domain-specific one (e.g. medicine, law etc). So you must be selective on the vocabulary.
  • Conversational text is preferred as we will mostly converse with each other and with the machines, putting some intonation. So the community should prefer these.
  • Many public domain sources will be from old books (e.g. from authors who died 70 years ago), and most of the time there will be old vocabulary/grammatical constructs that are not used anymore. If included, these will make your models perform worse. Same for religious texts for example.
  • etc etc

Therefore it is the job of the communities/individuals to create the text corpora.

Actually, there are some methods to create community-driven conversational sentences. For example we are using the following method: Just open some Google Docs Excel documents for a topic (at the hospital, at home, gone shopping, etc) and share it with your community, let people chat with questions and answers.


The problem is, I don’t know if I’ve finished my sentences or if I’ve been restricted. It’s been months despite adding new phrases.

This greatly hinders the collection of voices, it should not be so difficult to pull automatically sentences from pdfs in the public domain to at least not run out.

And even if there are dated words, is it not possible to filter? Or reuse the pronunciation of syllables in some way? Perhaps it is not very useful today, but who guarantees that this old vocabulary will not be able to be useful in the future? Through the syllables for example

Oh I see. You are right. Recently the old Sentence Collector application has been partially moved into the main Common Voice application and that caused a line of problems. There are quite a few issues posted on Github and there is a lot of work to be done to regain the old functionality.

You can now post one sentence each and it can be validated. But they will not be shown for recording for now. But this recent PR should enable them for recording when merged.

There is also a bulk sentence posting method via a PR, but it also does not work as of now, we all are waiting for these additions. Also, Common Voice is currently changing its backbone (AWS => Google Cloud), so bulk imports are also disabled for now for this reason.

I’d suggest adding sentences either way, they will be available when the transition is complete.

And even if there are dated words, is it not possible to filter?

Well, adding some might be considered. In Turkish we have many words from the Ottoman era, with roots in Arabic and Persian, and I included some of them. Although younger people do not use them in daily life, many older people use them. But it needs careful selection and better language knowledge.

You can see they might cause problems in other languages, for example, read this.

it should not be so difficult to pull automatically sentences from pdfs in the public domain to at least not run out

I have a bunch of scripts doing this for Turkish resources, but these are language-specific, e.g. converting numbers to text, correcting misspellings, expanding common abbreviations etc. Some examples can be found in my replace rules in my cv-sentence-extractor rules.

After the extraction, I read each sentence (twice), possibly translating words from old language to common usage, and give them to the community to find my mistakes.

Without such human intervention, one can easily destroy the dataset quality.

1 Like