Issues in the Romanian dataset

bozden · February 28, 2025, 9:34pm

Now, Q3 - in parts…

You can look at the data ONLY in the releases, or see them during validation. When you look the metadata in the release:

You will see *_sentences.tsv files for text-corpus only
Each other metadata file also has some sentence column valid for that data.

First part is already there (or not). Please see the explanations and a problem in this issue.

nouns, verbs, adjectives, and adverbs
compare it with a list from a bigger Romanian corpus

That would require language knowledge of course, maybe some NLP tooling for that particular language. I don’t have them, so I do generic tokenization and count of words for the frequencies in Analyzer text-corpus tab.

You should do these knowing the language, but anyway, I uploaded the intermediate files for you (v20.0 ro)… If our results do not merge, please break the glass (issue).

Nope, AI generated sentences are not allowed. Here is some info:

I think its time to talk about AI generated sentences again (read answers/resolution)

It is degenerative, and we will see in LLMs in a few years (actually dementia is already there).

Topic		Replies	Views
I'm almost giving up on the project. Feedback from a big contributor (10000 sentences sent, 7000 listened) Common Voice	24	2369	March 15, 2023
Translation of sentences from other-language corpuses Common Voice sentence-collection	14	2226	November 25, 2022
Inadequate Documentation Common Voice documentation	9	1691	September 23, 2022
Common Voice for Healthcare (Edge Cases) Common Voice	6	640	August 26, 2024
Common Voice Toolbox: Updated with CV v22.0 data Common Voice feedback , tooling	20	3490	November 19, 2025

Issues in the Romanian dataset

Related topics