Concerns about low quality Japanese voice data / 日本語の低品質データについて

I recently joined and have just finished about 1000 validations of Japanese voice clips. My gut feeling is that 30-40% of the clips are of poor quality due to incorrect pronunciation, high crackling noise, or intentional spamming (even just singing and laughing!).

Another problem is that many sentences are taken from copyrighted material (like anime), even though they are “validated”.

I think a certain number of people do this as part of class assignments, judging by the background noises. So one way to mitigate the low quality data would be to properly inform students beforehand. I hope some teachers read this post and take action. While it is great to encourage students to contribute to projects like Common Voice, if it is not done properly it can actually hurt the effort.

Finally, more validators are needed since the Japanese validation progress is currently only 36%.

(below is the summary in Japanese)

学校での課題としてCommon Voiceを使っている教育者の方がもしこれを読んでおられましたら、上記について学生・生徒のみなさんに周知していただければ幸いです。

未検証の音声データが大量にあるので、これから参加される方は録音(だけ)ではなく検証に注力したほうがよいと思います。