Concerns about low quality Japanese voice data / 日本語の低品質データについて

norimaki · November 25, 2024, 11:28am

I recently joined and have just finished about 1000 validations of Japanese voice clips. My gut feeling is that 30-40% of the clips are of poor quality due to incorrect pronunciation, high crackling noise, or intentional spamming (even just singing and laughing!).

Another problem is that many sentences are taken from copyrighted material (like anime), even though they are “validated”.

I think a certain number of people do this as part of class assignments, judging by the background noises. So one way to mitigate the low quality data would be to properly inform students beforehand. I hope some teachers read this post and take action. While it is great to encourage students to contribute to projects like Common Voice, if it is not done properly it can actually hurt the effort.

Finally, more validators are needed since the Japanese validation progress is currently only 36%.

(below is the summary in Japanese)

学校での課題としてCommon Voiceを使っている教育者の方がもしこれを読んでおられましたら、上記について学生・生徒のみなさんに周知していただければ幸いです。

未検証の音声データが大量にあるので、これから参加される方は録音（だけ）ではなく検証に注力したほうがよいと思います。

irvin · November 26, 2024, 1:16pm

Hi norimaki,

(I’m Irvin, volunteer from Taiwan)

We rely on volunteer to promote common voice locally, find the sentences that suitable to record, and verify the recording.

Please

down vote those low quality recording,
find good public domain sentences and add them to platform

The most important is ask others to join, tell them why this is important, and work together to improve the database.

you may find some other volunteer in Japanese Mozilla Community Slack, if you need the join link, please dm or respone, I can add you.

Topic		Replies	Views
We need a Q&A Common Voice feedback	5	2221	October 2, 2020
Subpar data uses Common Voice dataset	7	1518	June 5, 2019
Issues in the Romanian dataset Common Voice sentence-collection , feedback , issue	7	380	February 28, 2025
Discussion of new guidelines for recording validation Common Voice feedback	81	20425	November 29, 2021
Need Common Voice admin help with a volunteer Common Voice issue	18	1019	February 14, 2024

Concerns about low quality Japanese voice data / 日本語の低品質データについて

Related topics