I can't speak sentences in portuguese. There is no phrases for the language

bozden · August 31, 2023, 8:26pm

Oh I see. You are right. Recently the old Sentence Collector application has been partially moved into the main Common Voice application and that caused a line of problems. There are quite a few issues posted on Github and there is a lot of work to be done to regain the old functionality.

You can now post one sentence each and it can be validated. But they will not be shown for recording for now. But this recent PR should enable them for recording when merged.

There is also a bulk sentence posting method via a PR, but it also does not work as of now, we all are waiting for these additions. Also, Common Voice is currently changing its backbone (AWS => Google Cloud), so bulk imports are also disabled for now for this reason.

I’d suggest adding sentences either way, they will be available when the transition is complete.

And even if there are dated words, is it not possible to filter?

Well, adding some might be considered. In Turkish we have many words from the Ottoman era, with roots in Arabic and Persian, and I included some of them. Although younger people do not use them in daily life, many older people use them. But it needs careful selection and better language knowledge.

You can see they might cause problems in other languages, for example, read this.

it should not be so difficult to pull automatically sentences from pdfs in the public domain to at least not run out

I have a bunch of scripts doing this for Turkish resources, but these are language-specific, e.g. converting numbers to text, correcting misspellings, expanding common abbreviations etc. Some examples can be found in my replace rules in my cv-sentence-extractor rules.

After the extraction, I read each sentence (twice), possibly translating words from old language to common usage, and give them to the community to find my mistakes.

Without such human intervention, one can easily destroy the dataset quality.

Topic		Replies	Views
Problems finding public domain sentences Common Voice sentence-collection	26	2951	June 10, 2019
Common voice sentences are the opposite of "common" Common Voice participation , sentence-collection , feedback , issue	27	3726	September 7, 2024
I think its time to talk about AI generated sentences again Common Voice	11	1300	March 30, 2023
Extending our sentence collection capabilities Common Voice sentence-collection , announcements	19	3663	September 11, 2019
📖 Readme: How to see my language on Common Voice Common Voice announcements	40	13976	May 10, 2022

I can't speak sentences in portuguese. There is no phrases for the language

Related topics