Get WER for entire test set

Hello team,

Is there a way to the predicted transcripts for all the TEST set with WER? At present (v0.60) shows the WER only for a few (~10) test transcripts.
Kindly guide.

The first line, before the samples are shown, is the WER/CER for the entire set.

Adding to Reuben, I usually run it just with the test_files param and set

--test_output_file "/xxx/out.txt" \
--report_count 50 \

If you want to dig deeper, try the benchmarkstt repo on the out.txt file.

1 Like

@reuben: I meant I want to get the predicted transcripts for the entire test set with WER for each transcript. Is there a way to get it? Kindly guide.

@othiele: Thank you, will try it out.

Either set report_count = test_size or check the output file, it lists it for all inputs.

1 Like

Thank you @othiele, it worked. But the file is in default ASCII format. I tried

iconv -f ASCII -t UTF-8 out.txt > “out_utf8.txt”

but it didn’t work. You did you handle it?

{
	"char_distance": 35,
	"word_length": 3,
	"wer": 2.6666666666666665,
	"char_length": 26,
	"loss": 208.03085327148438,
	"src": "beim vorliegenden gesch\u00e4ft",
	"word_distance": 8,
	"wav_filename": "/media/data/LTLab.lan/agarwal/german-speech-corpus/swiss_german/clips/35795.wav",
	"res": "die kollegen und kolleginnen wie marie den elfte",
	"cer": 1.3461538461538463
}

It’s this problem: https://stackoverflow.com/a/18337754/346048

I’ll make a PR for a fix. Alternatively you can load the file using Python and apply a fix locally so you don’t have to re-run the test epoch.

2 Likes

@reuben:

In the test results, I see a problem. Some resulting transcripts are very short (1-2 words) and some are very long (15-20 words) for a source transcripts of 5-8 words.

I tried changing LM_alpha and beta parameters, but got not much success. Do you recommend anything to solve this problem.

Note: I am working on German data and have trained the model with ~1000 Hours of speech data.

@reuben, Kindly advice on it.

Others will be better qualified than I to comment, but I’d guess it’s worth looking at two areas:

1. Your dataset: what’s the audio quality like? And how about the transcription quality? Bear in mind that 0.70 was trained on maybe six times as much audio as your 1,000 hrs (just a back of envelope calc. based on the datasets mentioned on release page under Training Regimen section)

2. Your language model/scorer: how large a text corpus did you use to create it? Was it just the transcribed text from your audio dataset or was it more comprehensive? Unless you’re targeting a narrow vocab scenario (+ it doesn’t sound like you are) then you’ll likely want the biggest you can manage, so that the model makes sensible predictions about sentence probabilities.

@reuben, Kindly advice on it.

Given it wasn’t that long after your earlier question and people had already helped you, it might be worth a little patience :slightly_smiling_face: People are often happy to help but they aren’t sitting there just waiting for your next question… :wink:

Anyway, I hope you get to the bottom of your issues with the transcriptions

1 Like