Issues in the Romanian dataset

elearningbakery · February 19, 2025, 10:14pm

Hi, folks. I’ve just learnt about the Common Voice project and, as a Romanian speaker and translator, was curious to see what the datasets looked like. I downloaded the Common Voice Delta Segment 20 and, apart from there being three voices instead of two, I was wondering if the following observations were considered issues or were ok with the project community: 1. one of the speakers (male) does not really articulate. 2. the sentences themselves are occasionally ungrammatical, cut off mid-sentence, or just bizarre. 3. it is not clear which of the attached .tsv files should be used with the recordings. 4. the validated_sentences.tsv file contains sentences written without diactritics or sometimes also punctuation, or with a salad of diacritics, or non-sensical fragments which are misspelled and incomplete 5. the subject-matter jumps from either very religious to sort of news or subtitles.

I guess my Qs revolve around: is there a community manager for Romanian who can at least fix the mistakes in the written transcripts or do you intend to use these datasets as they are when training your models?

Many thanks and looking forward to hearing from you,
Dragoș

bozden · February 20, 2025, 10:00am

@elearningbakery

Hey Dragoş, welcome to the community. I’m a veteran contributor/volunteer, but answering some of your questions would require language knowledge and more detailed dataset analysis (which takes time). Thus I’ll give some starting points only:

There are some very old text-corpus related talks in this forum, search for “Romanian”.
MCV is volunteer based, crowd-sourced project, so people come and go. With a quick look I can see people contributed towards v7.0, then the dataset is left alone. So there might not be people curating/caring for the dataset now.
You should actually look at the whole dataset, delta versions include only the last 3 month period, they are usually not representative for the whole dataset.
Q1: It is OK to have foreign language speakers, people with accents. They should not be abundant thou.
Q2: That’s bad, but check the reported.tsv file, people report them so these can be checked and excluded from AI training if necessary.
Q3: You should really look at the validated.tsv file, those include recordings validated by the community.
Q4: That’s bad again.
Q5: It seems those are from an initial set, CC-0 Balkan news and maybe the Bible? The text corpus is small, one needs to continuously add new and conversational sentences.
NO: You cannot edit the sentences, they are (hash of them are) the index. But with a database migration, one can disable them for not being shown for new recordings.
Some of the problems you mention can be fixed with postprocessing (e.g. spelling correction scripts) but it is better not to have them of course.

Here are two webapps you might like to check for further analysis of the dataset.

elearningbakery · February 27, 2025, 2:09pm

Many thanks for these pointers, dear Bülent @bozden!

Interesting to see what folks have been discussing with regard to Romanian (and easy to look at, as it wasn’t much :)). I’m curious to hear how the experiments with using Romanian automatic speech recognition trained with speakers from Romania has been working for speakers from Moldova (there are differences in pronunciation and vocabulary between the two)
I’ve downloaded the whole dataset, and am looking at that now.
Q1: got you. The full dataset has better, more clear speakers who can articulate, from what I have been listening to so far.
Q2: reported.tsv had 400+ ungrammatical or misspelled sentences, or sentences in a language other than Romanian. Some could be rescued and re-added to the dataset, imo.
Q3: the validated.tsv for Romanian still has a few errors in the text, so I am wondering about the error tolerance of this project in general - is it “as good as it happens to be”, or a more specific target like 5%, or less/more?
I started inserting missing diacritics in the validated.tsv sentences, which is how I also noticed that some sentences are repeated (maybe that is ok if they are read out by various people)
This, however, made me also wonder about data organisation: in RO, the validated.tsv has over 18k sentences, the validated_sentences.tsv has close to 14k, while in EN delta 20 (because I was just curious to compare and didn’t want to download the whole EN corpus, validated.tsv has 251 sentences, and validated_sentences.tsv 1.6mil sentences). So I am guessing the validated_sentences.tsv should be the master file being fed by the validated deltas, but I’m not sure if this was also strictly observed in the Romanian corpus…
Q5: I see. I played a bit with the RO corpus of validated sentences and saw that out of 130k+ words it uses, the number of unique nouns, verb, or adjectives is not high (3.5k, 962, and 1.7k respectively). Maybe having the same words repeated in many short sentences will help with building a model capable of recognising them, but I wonder about the rest of the vocabulary which is not present in the data…

All in all, thank you again for your time and explanations. It’s been interesting looking at this dataset, for sure! I will keep an eye on the discussions here in case there is renewed interest in doing things to the RO dataset.

BFN and thank you again,
Dragoș

bozden · February 27, 2025, 2:45pm

@elearningbakery: Glad it was helpful Dragoş

So, now you looked at it a bit more, let me give you some more info:

VOICE CORPUS

Above you can see the dataset has been worked on between v6.1 and v7.0 (6 months). Probably there was a campaign to record, so recorded clips and users increased a lot. On the other hand very few of them are LISTENed (validated).

After that people kept coming and recording, very few validations again.

First order of business should be to build a team to validate all those recordings waiting (~27 hours) - low hanging fruit, but it will need lots of efford.

TEXT CORPUS

Fixing sentences in validated.tsv BEFORE the data goes into AI training will be a good way to post-process it and increase quality. BUT, don’t do it by hand, write a program (make it open source maybe?). Because each time the release comes out, they will be taken from the database and you should repeat the effort for v21.0 and onwards.

How to fix them permanently?

You can’t (maybe shouldn’t?). If a bad sentence is recorded, it is not easy to remove them, because they are tied to existing recordings. The team is reluctant to do so, unless absolutely necessary (e.g. license/legal problems).
But you can MARK them not to be shown to new contributions, using a database migration, just by setting the is_used flag to FALSE for those sentences. A couple of days ago I sent one such PR for ady, if you want to do that, it is here.
Re-add corrected ones into text-corpus.

Number of sentences in your text corpus?

There might be about 90k sentences added but not-yet-validated.

DATASET / GENERAL

So, the dataset is left in the wild for a long time. If I were you I’d form a team and work on that. 10 people can do the task 10x more speedy.

bozden · February 27, 2025, 2:59pm

Leftover answers to your questions:

In the past 5% error in text-corpus was a max, but I never liked it. My aim is always 0%, but I can only accomplish 1-2%. Even triple reading does not suffice, some pass though.

IMO the text corpus is most important in the pipeline.

The same sentence can be recorded up to 15 times - nowadays. But this is again a very high value. It depends on the model architecture and the application, but max 5 recordings per sentence is better I think - for general purpose ASR. If you work with wake-words, you would need the same word recorded thousand times for example…

As you can see, you values are fine. Having the same sentence from different genders and/or accents will also help the models.

Here are your word frequencies:

Of course some words like “one”, “I” will appear much more, but that text corpus is very small to see the distinctions. Validate waiting ones and see what happens in a year or so…

I have all intermediate data from which I get these statistics (for all CV languages), and I can share it if you like…

elearningbakery · February 28, 2025, 8:30pm

This is super amazing, @bozden Bülent, thank you so much! Amazing and I have noted your advice about looking for another way to fix the written corpus apart from manually.

I’ve rolled up my sleeves and have started recording and validating, too - btw, I like the mechanism to report the sentences without diacritics or other kinds of errors, and then move on to the following sentence; it’s also great that the recordings can be re-listened to and overwritten if necessary before submission.

So now for four silly questions, perhaps, but quite important for me before going ahead and trying to get my students and other contacts involved in validating and recording, please:

I notice that new versions of the CV corpus are published every 3 months or so for Romanian? Is this how it will remain or does it depend on maintaing a certain amount of activity (e.g. recording and validating a certain minimum number of sentences)?
Apart from getting badges and a leaderboard place for contributing to the project, are there any Mozilla (or other) live ASR and TTS implementations which use Romanian which the contributors can revisit and check every few weeks to notice progress? I think celebrating the contributions made by showing the community how the tools are progressing thanks to their donations is a really great way to maintain motivation - have none of the other language teams done this?
Would it be possible to have a look at the sentences which have been validated, the ones which need to be validated, and the ones which still need to be recorded, get a frequency list of content words (nouns, verbs, adjectives, and adverbs) from them, compare it with a list from a bigger Romanian corpus (maybe even roTenTen21), and then use LLMs to generate example sentences with the content words which are not yet present enough in the CV data? Would LLM-generated sentences of up to 14 words each be acceptable for the CV?
What is the procedure for joining the localization team? I am looking at Common Voice which should be in RO, but only the heading is… I have been working with CAT tools for a while now and know my way around several, so I think I can help with the L10N process. Maybe some of my students can handle other languages in the community, too. UPDATE: scratch that, I am now in Pontoon and having a look around

Thank you again for your time and super kind help,
Dragoș

bozden · February 28, 2025, 9:13pm

You are asking wonderful questions being a new contributor @elearningbakery

Q1: A new FULL version for ro will be published regardless of VOICE contributions. If no contribution, it will be the same as previous (except some metadata, e.g. new sentences, new reported sentences, new validations etc. A DELTA version is only released if a NEW VOICE contribution is done in that 3 month period. I have an open-source script to merge a previous FULL version with new DELTA, which saves bandwidth and diskspace, especially good for large datasets.

Q2: MCV is for dataset creation, not for AI models and applications using them. You should look around (e.g. HuggingFace, OpenAI, nVidia, Kaggle etc) to find if somebody worked on Romanian. In here, we have (try to have) language communities, who can do local/global campaigns, do some fund-raising and give tokens of appreciation, etc. E.g. currently I’m helping Circassian languages (ady & kbd - a minority language in Turkey) and trying to build communities, give training, design and implement campaigns, etc (search Google for “#CommonVoice” “#Circassian”). There is a COMMUNITIES.md document people post if people have such meeting points, and here and here are some of my views on communities.

I personally created a 3D Voice-Chess application in the past to show how nicely their contribution can help AI when I was managing a campaign for Turkish dataset (old and non-functional now).

Q3: Multi-part question, I’ll answer that below…

Q4: Yep, Pontoon. Make sure you also joined the Matrix channel for Common Voice to get quick support/answers. Here are Managers/Translators for Romanian. If your suggestions do not pass, try to contact them, else write to the Matrix channel, the team will help you.

bozden · February 28, 2025, 9:34pm

Now, Q3 - in parts…

You can look at the data ONLY in the releases, or see them during validation. When you look the metadata in the release:

You will see *_sentences.tsv files for text-corpus only
Each other metadata file also has some sentence column valid for that data.

First part is already there (or not). Please see the explanations and a problem in this issue.

nouns, verbs, adjectives, and adverbs
compare it with a list from a bigger Romanian corpus

That would require language knowledge of course, maybe some NLP tooling for that particular language. I don’t have them, so I do generic tokenization and count of words for the frequencies in Analyzer text-corpus tab.

You should do these knowing the language, but anyway, I uploaded the intermediate files for you (v20.0 ro)… If our results do not merge, please break the glass (issue).

Nope, AI generated sentences are not allowed. Here is some info:

I think its time to talk about AI generated sentences again (read answers/resolution)

It is degenerative, and we will see in LLMs in a few years (actually dementia is already there).

Topic		Replies	Views
Large Romanian Dataset Formating Help Common Voice sentence-collection	4	499	July 15, 2019
Accessing the extended version of a dataset Common Voice participation , issue , dataset	8	1549	December 6, 2021
Multi-Language-Dataset (Beta) is gone Common Voice issue , dataset	5	631	February 20, 2019
Dataset release: MCV 14 Common Voice	5	954	July 11, 2023
Romanian sentences Common Voice	0	450	May 16, 2018

Issues in the Romanian dataset

Related topics