The website says it has no phrases for the language. And from Saverio Morelli’s application it says that the API has no phrases.
Why doesn’t Mozilla pull public domain phrases automatically to solve this annoying problem?
The website says it has no phrases for the language. And from Saverio Morelli’s application it says that the API has no phrases.
Why doesn’t Mozilla pull public domain phrases automatically to solve this annoying problem?
Hey @maria2, welcome. I feel your pain. The most problematic and “costly” (read resource intensive) part of the dataset creation is building a good text corpus.
I cannot see an easy way of creating a text corpus like you propose - for 100+ languages. Here are some reasons:
Therefore it is the job of the communities/individuals to create the text corpora.
Actually, there are some methods to create community-driven conversational sentences. For example we are using the following method: Just open some Google Docs Excel documents for a topic (at the hospital, at home, gone shopping, etc) and share it with your community, let people chat with questions and answers.
The problem is, I don’t know if I’ve finished my sentences or if I’ve been restricted. It’s been months despite adding new phrases.
This greatly hinders the collection of voices, it should not be so difficult to pull automatically sentences from pdfs in the public domain to at least not run out.
And even if there are dated words, is it not possible to filter? Or reuse the pronunciation of syllables in some way? Perhaps it is not very useful today, but who guarantees that this old vocabulary will not be able to be useful in the future? Through the syllables for example
Oh I see. You are right. Recently the old Sentence Collector application has been partially moved into the main Common Voice application and that caused a line of problems. There are quite a few issues posted on Github and there is a lot of work to be done to regain the old functionality.
You can now post one sentence each and it can be validated. But they will not be shown for recording for now. But this recent PR should enable them for recording when merged.
There is also a bulk sentence posting method via a PR, but it also does not work as of now, we all are waiting for these additions. Also, Common Voice is currently changing its backbone (AWS => Google Cloud), so bulk imports are also disabled for now for this reason.
I’d suggest adding sentences either way, they will be available when the transition is complete.
And even if there are dated words, is it not possible to filter?
Well, adding some might be considered. In Turkish we have many words from the Ottoman era, with roots in Arabic and Persian, and I included some of them. Although younger people do not use them in daily life, many older people use them. But it needs careful selection and better language knowledge.
You can see they might cause problems in other languages, for example, read this.
it should not be so difficult to pull automatically sentences from pdfs in the public domain to at least not run out
I have a bunch of scripts doing this for Turkish resources, but these are language-specific, e.g. converting numbers to text, correcting misspellings, expanding common abbreviations etc. Some examples can be found in my replace
rules in my cv-sentence-extractor rules.
After the extraction, I read each sentence (twice), possibly translating words from old language to common usage, and give them to the community to find my mistakes.
Without such human intervention, one can easily destroy the dataset quality.