Listening / Validating for Dutch language turns up a lot of Frysian (Frysk) sentences

While listening / validating for the Dutch language, the system turns up a lot (2 or 3 out of 5) of Frysian (Frysk) sentences like:

  • Jonge bern binne taalgefoelich en leare maklik in twadde taal.
  • Dêr soe ik yn alle gefallen al aardich mei holpen wêze.

I have reported some, but they keep coming up and with 2 or 3 out of 5 it’s no fun.
Is there something mixed up ?

3/5 seems too high.

@mkohler can we check where these sentences are coming from?

/cc @Fjoerfoks @stergro

Both these sentences are in the fy-NL data sources: https://github.com/mozilla/voice-web/blob/master/server/data/fy-NL/sentence-collector.txt#L7066 and https://github.com/mozilla/voice-web/blob/master/server/data/fy-NL/sentence-collector.txt#L3259 and nowhere else. Are we mixing up languages again? Also doesn’t explain why it would show up so many times either…

I have a theory this is due to the same person recording in multiple languages and the clips getting tagged with the wrong language ID for validation.

1 Like

I think that github issue is exactly what is happening here as well, my frysian is not that good, but the audio seems correct and was spoken without hesitation as well.

I also found a non-english sentence in the english dataset, when I did the IPFS triee index. IJsberen hebben het steeds moeilijker om te overleven

Been thinking about running the data set trough a spell checker to see if there is other mix ups.

That is a Dutch sentence. Wow, something is really messed up. If you need any help to investigate or fixing, please let me know. Thanks for flagging @SanderE

Thanks for the investigation here.

What I feel more important is to get a sense on the volume of this issue, low volume (< 3-5%) of sentences with issues is acceptable since we have a reporting feature on the site to identify them.

If we don’t know and fix the root cause, it will probably come back soon, at least for Dutch.

For Dutch, all people who speak Frysian also speak Dutch, so it isn’t quite unlikely they contribute to both languages and switch during a session.

@nukeador But reporting a sentence doesn’t actually do anything, right?

It does record the report, so we have the data.

What we don’t have yet is a process to act on that information, but keep reports coming, we’ll use them at some point!

I was going to make a thread reporting this issue, but then I found this thread already. Right now about 1 out of 5 sentences I am asked to validate seems to be in Frisian. I report them all, but this is actually quite annoying as there doesn’t seem to be a keyboard shortcut for “report as other language”, so this takes quite some time.

I’m also wondering what is happening to these Frisian recordings, because they seem good quality, so it would be a waste if they all get rejected as Dutch sentences, instead of being approved as Frisian sentences where they belong.

If it would be possible to get a list with the Dutch recordings queued up for validation, it would possibly be much easier to:
run the that list of sentences through hunspell and/or pycld2
quick recheck of the ones that are reported as not Dutch manually
let people remove those recordings from the Dutch queue and where possible to the right queue (probably most will be Frisian).

If Frisian sentences are not mixed here:

Then, this is definitely an issue. Can you share what’s your username on the portal and which sentences your reported? Do you have Frisian as a language on your profile?

Thanks!

No the sentence lists are good. It looks like it is the recording step where some how (stale session data as @dabinat suggested ?) the language gets mixed up.
As validator I don’t have Frisian in my profile, only Dutch.
But the person (female voice) who has recorded both sentences from the starting post perhaps does have both Dutch and Frisian.
These two sentences from the starting post should be reported by me (also
SanderE as profile name as well there (although at the time it could be “sander” or even done anonymously, I’m not entirely sure).

I just took a look at the voice web code and especially the mysql DB schema, I tried to puzzle it together from all the schema changes.
It seems both the sentence and the clips table have a foreign key to the locales table.
But there seems to be no requirement for the sentence locale id to be equal to the clip locale id.
So what does a SQL query turn up when you compare the locale id’s for each clip with the locale id from the original sentence and get the mismatches (for Dutch and other languages) ?
Hmmm I seem to have missed a migration from 2019-12 which seems to have forced them to have the same locale if they differ:

So it seems to at least have been a problem before, with at least a fixup for the data (although that could be incomplete, because if the locale was wrong, people could have down voted it instead of reporting, so perhaps the votes will have to be reset to zero as well).

But is there any reason to have a separate locale id for clips (I can’t think of it, except as a query optimization so you don’t have to do a join when getting the clips) ?

I’ll ping the team to see if they can investigate with all these data, thanks!

As the local id is sent back from the browser, perhaps one thing that could cause it is browser-sessions, which in general can mix up when you have multiple browser tabs/windows opened up in different stages of web apps.
Haven’t checked but that could also be the case for the voice-web webapp.

Most apps force you to only login once (gmail for example), it was the only thing IE 6.0 did right, manage sessions per window so you could login multiple times in web apps.

1 Like