Issues in the Romanian dataset

@elearningbakery

Hey Dragoş, welcome to the community. I’m a veteran contributor/volunteer, but answering some of your questions would require language knowledge and more detailed dataset analysis (which takes time). Thus I’ll give some starting points only:

  • There are some very old text-corpus related talks in this forum, search for “Romanian”.
  • MCV is volunteer based, crowd-sourced project, so people come and go. With a quick look I can see people contributed towards v7.0, then the dataset is left alone. So there might not be people curating/caring for the dataset now.
  • You should actually look at the whole dataset, delta versions include only the last 3 month period, they are usually not representative for the whole dataset.
  • Q1: It is OK to have foreign language speakers, people with accents. They should not be abundant thou.
  • Q2: That’s bad, but check the reported.tsv file, people report them so these can be checked and excluded from AI training if necessary.
  • Q3: You should really look at the validated.tsv file, those include recordings validated by the community.
  • Q4: That’s bad again.
  • Q5: It seems those are from an initial set, CC-0 Balkan news and maybe the Bible? The text corpus is small, one needs to continuously add new and conversational sentences.
  • NO: You cannot edit the sentences, they are (hash of them are) the index. But with a database migration, one can disable them for not being shown for new recordings.
  • Some of the problems you mention can be fixed with postprocessing (e.g. spelling correction scripts) but it is better not to have them of course.

Here are two webapps you might like to check for further analysis of the dataset.

1 Like