Oh I see. You are right. Recently the old Sentence Collector application has been partially moved into the main Common Voice application and that caused a line of problems. There are quite a few issues posted on Github and there is a lot of work to be done to regain the old functionality.
You can now post one sentence each and it can be validated. But they will not be shown for recording for now. But this recent PR should enable them for recording when merged.
There is also a bulk sentence posting method via a PR, but it also does not work as of now, we all are waiting for these additions. Also, Common Voice is currently changing its backbone (AWS => Google Cloud), so bulk imports are also disabled for now for this reason.
I’d suggest adding sentences either way, they will be available when the transition is complete.
And even if there are dated words, is it not possible to filter?
Well, adding some might be considered. In Turkish we have many words from the Ottoman era, with roots in Arabic and Persian, and I included some of them. Although younger people do not use them in daily life, many older people use them. But it needs careful selection and better language knowledge.
You can see they might cause problems in other languages, for example, read this.
it should not be so difficult to pull automatically sentences from pdfs in the public domain to at least not run out
I have a bunch of scripts doing this for Turkish resources, but these are language-specific, e.g. converting numbers to text, correcting misspellings, expanding common abbreviations etc. Some examples can be found in my replace
rules in my cv-sentence-extractor rules.
After the extraction, I read each sentence (twice), possibly translating words from old language to common usage, and give them to the community to find my mistakes.
Without such human intervention, one can easily destroy the dataset quality.