Problematic sentence in japanese

I was using the japanese dataset and encountered sentence_id 2c8dfac7213808b518c19115b70ab325c10a646aeb9fc7593250bc32161237fc which is 5k characters long. The corresponding audio 39027804 only says the first 78. I feel like there should be a rule to filter out such length problems from at least the train set. Thanks for your work.

Hi @joiwfe, welcome and thank you for the report.

I’ll check and report here…

  1. I found the sentence record, it is from 2023, somehow it passed the defenses… It can be pre-inclusion of sentence collector into MCV. I’ll check deeper.
  2. Although the sentence has been recorded (as you say), it does not have a validated recording. It should not be in `validated.tsv` file. I’ll DL the dataset and check.
  3. In any case that record should be eliminated. We implemented QA process into SPS, and also incorporated some bad audio elimination into SPS release process, but we don’t have full QA in SCS yet.

I’ll update with research results…

For now, please make sure to only use those in `validated.tsv`, and run a pre-process to find those outliers. We will be upgrading our processes.

Thanks for checking. I think the audio is validated since I’m only using train.tsv.