Training Error due to file formatting

sanjay.pandey · July 31, 2019, 6:25am

Hello @reuben @lissyx I hope you are doing fine.

While training my model after some time I am receiving an error as “ValueError: File format b’\x1aE\xdf\xa3’… not understood.”
While i understand there is wav file formatting problem for which i did FFmpeg to convert into 16khz i use to do the same every time and it always worked but receiving this error for the first time. Even i am unable to find which particular wav file causing the problem. Please help!

alchemi5t · July 31, 2019, 8:04am

Check you train/test/dev csv. Probably missing the headers

wav_filename,wav_filesize,transcript

sanjay.pandey · July 31, 2019, 8:20am

No there is no issue as such. It is related to file formatting or corrupted file maybe

eggonlea · July 31, 2019, 4:49pm

@sanjay.pandey you’re the only person who has access to the wav files. I doube anybody else can actually help you find it.

I’d suggest you search the hex code b’\x1aE\xdf\xa3’ in your dataset.

dabinat · July 31, 2019, 9:09pm

It’s probably some additional characters in the CSV file that may not be visible when viewed as ASCII text.

An advanced text editor should have the ability to strip this out. I use TextWrangler on the Mac for this (the Zap Gremlins option); I’m not sure what the equivalent Windows or Linux app would be.

sanjay.pandey · August 14, 2019, 5:46am

Hello @reuben @lissyx hope you are doing well.
Can you please help in this i am stuck on this since 1 week! Please help!

lissyx · August 14, 2019, 6:52am

You have already been given as much help as we can. There’s something bogus in your data, we can’t find it for you.

sanjay.pandey · August 14, 2019, 8:06am

I have listened every file and everything sounds good then i ran every file with google speech to text to see audio file error but it didnt show me any error.The help which i got above was of no use when i tried.
Also in error it doesnt give which row or file causing the error and hence not able to solve.Here is the screenshot

lissyx · August 14, 2019, 8:09am

Your reply is also of no use. Don’t use screenshots. I can’t read your error.

Read what has been replied: corruption does not mean that some tools cannot read your files.

Fix your dataset.

lissyx · August 14, 2019, 8:12am

Help yourself: preprocess.py has file.wav_filename so you can just dump that and see what file is bogus …

reyxuan · August 14, 2019, 8:24am

In many cases these errors are produced when combining UNIX and Windows format. Try using dos2unix.

reuben · August 14, 2019, 8:35am

The header in the error message (1A 45 DF A3) means the file is a Matroska container file (.mkv, .mka, .webm, etc). Check if all your training files are in the proper format using a tool like file or soxi. Training files for DeepSpeech should be WAVE audio, signed 16 bit PCM, mono, 16000 Hz.

Topic		Replies	Views
KeyError: 'wav_filename' on training DeepSpeech DeepSpeech	13	2646	March 31, 2019
KeyError: 'wav_filename' DeepSpeech	19	1585	July 21, 2020
ValueError DeepSpeech	35	2677	March 23, 2020
Invalid String DeepSpeech	7	1042	July 21, 2020
Error while training the model DeepSpeech	2	302	March 12, 2020

Training Error due to file formatting

Related topics