Listening / Validating for Dutch language turns up a lot of Frysian (Frysk) sentences

SanderE · May 8, 2020, 6:03pm

While listening / validating for the Dutch language, the system turns up a lot (2 or 3 out of 5) of Frysian (Frysk) sentences like:

Jonge bern binne taalgefoelich en leare maklik in twadde taal.
Dêr soe ik yn alle gefallen al aardich mei holpen wêze.

I have reported some, but they keep coming up and with 2 or 3 out of 5 it’s no fun.
Is there something mixed up ?

nukeador · May 8, 2020, 6:26pm

3/5 seems too high.

@mkohler can we check where these sentences are coming from?

/cc @Fjoerfoks @stergro

mkohler · May 8, 2020, 8:14pm

Both these sentences are in the fy-NL data sources: common-voice/server/data/fy-NL/sentence-collector.txt at master · common-voice/common-voice · GitHub and common-voice/server/data/fy-NL/sentence-collector.txt at master · common-voice/common-voice · GitHub and nowhere else. Are we mixing up languages again? Also doesn’t explain why it would show up so many times either…

dabinat · May 8, 2020, 8:20pm

I have a theory this is due to the same person recording in multiple languages and the clips getting tagged with the wrong language ID for validation.

https://github.com/mozilla/voice-web/issues/2488

SanderE · May 8, 2020, 8:37pm

I think that github issue is exactly what is happening here as well, my frysian is not that good, but the audio seems correct and was spoken without hesitation as well.

isomorph70 · May 9, 2020, 4:36pm

I also found a non-english sentence in the english dataset, when I did the IPFS triee index. IJsberen hebben het steeds moeilijker om te overleven

Been thinking about running the data set trough a spell checker to see if there is other mix ups.

Fjoerfoks · May 9, 2020, 5:14pm

That is a Dutch sentence. Wow, something is really messed up. If you need any help to investigate or fixing, please let me know. Thanks for flagging @SanderE

nukeador · May 11, 2020, 10:54am

Thanks for the investigation here.

What I feel more important is to get a sense on the volume of this issue, low volume (< 3-5%) of sentences with issues is acceptable since we have a reporting feature on the site to identify them.

SanderE · May 11, 2020, 11:07am

If we don’t know and fix the root cause, it will probably come back soon, at least for Dutch.

For Dutch, all people who speak Frysian also speak Dutch, so it isn’t quite unlikely they contribute to both languages and switch during a session.

dabinat · May 11, 2020, 3:55pm

@nukeador But reporting a sentence doesn’t actually do anything, right?

nukeador · May 11, 2020, 3:58pm

It does record the report, so we have the data.

What we don’t have yet is a process to act on that information, but keep reports coming, we’ll use them at some point!

willem · May 23, 2020, 12:20pm

I was going to make a thread reporting this issue, but then I found this thread already. Right now about 1 out of 5 sentences I am asked to validate seems to be in Frisian. I report them all, but this is actually quite annoying as there doesn’t seem to be a keyboard shortcut for “report as other language”, so this takes quite some time.

I’m also wondering what is happening to these Frisian recordings, because they seem good quality, so it would be a waste if they all get rejected as Dutch sentences, instead of being approved as Frisian sentences where they belong.

SanderE · May 23, 2020, 12:48pm

If it would be possible to get a list with the Dutch recordings queued up for validation, it would possibly be much easier to:
run the that list of sentences through hunspell and/or pycld2
quick recheck of the ones that are reported as not Dutch manually
let people remove those recordings from the Dutch queue and where possible to the right queue (probably most will be Frisian).

nukeador · May 25, 2020, 9:50am

If Frisian sentences are not mixed here:

Then, this is definitely an issue. Can you share what’s your username on the portal and which sentences your reported? Do you have Frisian as a language on your profile?

Thanks!

SanderE · May 25, 2020, 7:24pm

No the sentence lists are good. It looks like it is the recording step where some how (stale session data as @dabinat suggested ?) the language gets mixed up.
As validator I don’t have Frisian in my profile, only Dutch.
But the person (female voice) who has recorded both sentences from the starting post perhaps does have both Dutch and Frisian.
These two sentences from the starting post should be reported by me (also
SanderE as profile name as well there (although at the time it could be “sander” or even done anonymously, I’m not entirely sure).

I just took a look at the voice web code and especially the mysql DB schema, I tried to puzzle it together from all the schema changes.
It seems both the sentence and the clips table have a foreign key to the locales table.
But there seems to be no requirement for the sentence locale id to be equal to the clip locale id.
So what does a SQL query turn up when you compare the locale id’s for each clip with the locale id from the original sentence and get the mismatches (for Dutch and other languages) ?
Hmmm I seem to have missed a migration from 2019-12 which seems to have forced them to have the same locale if they differ:

github.com

mozilla/voice-web/blob/29ae172fbc9b1657360f7572ff3d5ec49915c23c/server/src/lib/model/db/migrations/20191227183100-match-clip-sentence-locale.ts

export const up = async function (db: any): Promise<any> {
  // Note: Manual backfill to follow.
  return db.runSql(`
    UPDATE clips
    INNER JOIN sentences
    	ON clips.original_sentence_id = sentences.id
  	SET clips.locale_id = sentences.locale_id
    WHERE clips.locale_id <> sentences.locale_id
  `);
};

export const down = function (): Promise<any> {
  return null;
};

So it seems to at least have been a problem before, with at least a fixup for the data (although that could be incomplete, because if the locale was wrong, people could have down voted it instead of reporting, so perhaps the votes will have to be reset to zero as well).

But is there any reason to have a separate locale id for clips (I can’t think of it, except as a query optimization so you don’t have to do a join when getting the clips) ?

nukeador · May 26, 2020, 12:17pm

I’ll ping the team to see if they can investigate with all these data, thanks!

SanderE · May 26, 2020, 1:41pm

As the local id is sent back from the browser, perhaps one thing that could cause it is browser-sessions, which in general can mix up when you have multiple browser tabs/windows opened up in different stages of web apps.
Haven’t checked but that could also be the case for the voice-web webapp.

Most apps force you to only login once (gmail for example), it was the only thing IE 6.0 did right, manage sessions per window so you could login multiple times in web apps.

johanVD · May 31, 2020, 12:09pm

I seems to have the same issue. around 1/50 sentences to validate is in Fries.
I’m only have Dutch language defined in my profile.
(username = Johan - see top 10 in Dutch language)

SanderE · June 4, 2020, 12:58am

@nukeador any news from the devs on this ?
(question in light of: ✅ June Validation Campaign: Enhance the upcoming dataset release! as I’m waiting with resuming validation until at least the DB queue is cleaned up (root cause sorted out would be nice, but is somewhat less urgent)

johanVD · June 7, 2020, 10:02am

As on today, I also receive english sentences (recorded by a Dutch native speaken male person) in the Dutch “to validate” lists.

Topic		Replies	Views
Issue: dutch sentence with Fries language Common Voice issue	0	385	October 14, 2020
Validated sentences to zero? german and english Common Voice issue	2	448	March 1, 2019
Less validations than expected Common Voice issue	2	405	March 18, 2019
Validated clips reappear Common Voice issue	3	610	February 7, 2021
Problem with some French sentences Common Voice issue	11	956	January 7, 2019

Listening / Validating for Dutch language turns up a lot of Frysian (Frysk) sentences

Related topics