Issues in the Romanian dataset

bozden · February 20, 2025, 10:00am

Hey Dragoş, welcome to the community. I’m a veteran contributor/volunteer, but answering some of your questions would require language knowledge and more detailed dataset analysis (which takes time). Thus I’ll give some starting points only:

There are some very old text-corpus related talks in this forum, search for “Romanian”.
MCV is volunteer based, crowd-sourced project, so people come and go. With a quick look I can see people contributed towards v7.0, then the dataset is left alone. So there might not be people curating/caring for the dataset now.
You should actually look at the whole dataset, delta versions include only the last 3 month period, they are usually not representative for the whole dataset.
Q1: It is OK to have foreign language speakers, people with accents. They should not be abundant thou.
Q2: That’s bad, but check the reported.tsv file, people report them so these can be checked and excluded from AI training if necessary.
Q3: You should really look at the validated.tsv file, those include recordings validated by the community.
Q4: That’s bad again.
Q5: It seems those are from an initial set, CC-0 Balkan news and maybe the Bible? The text corpus is small, one needs to continuously add new and conversational sentences.
NO: You cannot edit the sentences, they are (hash of them are) the index. But with a database migration, one can disable them for not being shown for new recordings.
Some of the problems you mention can be fixed with postprocessing (e.g. spelling correction scripts) but it is better not to have them of course.

Here are two webapps you might like to check for further analysis of the dataset.

Topic		Replies	Views
I'm almost giving up on the project. Feedback from a big contributor (10000 sentences sent, 7000 listened) Common Voice	24	2262	March 15, 2023
Translation of sentences from other-language corpuses Common Voice sentence-collection	14	2169	November 25, 2022
Inadequate Documentation Common Voice documentation	9	1631	September 23, 2022
Common Voice for Healthcare (Edge Cases) Common Voice	6	579	August 26, 2024
Common Voice Toolbox: Updated with CV v22.0 data Common Voice feedback , tooling	20	3364	November 19, 2025

Issues in the Romanian dataset

Related topics