Tokenizing data error, while training a dataset with deepspeech

mustafaxfe · January 25, 2019, 12:35pm

I have created a speech dataset to train with DeepSpeech while following this( https://medium.com/@klintcho/creating-an-open-speech-recognition-dataset-for-almost-any-language-c532fb2bc0cf ) tutorial.

But, I couldn’t trained my dataset with deepspeech. It gives an error as a result of train command like

python DeepSpeech.py --train_files /mnt/c/wsl/teneke_out_bolum1/

It gives:

pandas.errors.ParserError: Error tokenizing data. C error: Calling >read(nbytes) on source failed. Try engine=‘python’.

I have created dataset after aeneas force allignment and fine tuning with finetuneas:

Here is my code that I used on Google Colab to train with DeepSpeech:

gist.github.com

https://gist.github.com/mustafaxfe/d20be114ca7cea5c47ea5cc85653c761

deepspeech_data.py

#First Step
book = AudioSegment.from_mp3("/content/gdrive/My Drive/TASR/kitaplar/teneke/bolum1/Teneke_Yasar_Kemal_01.mp3")

with open('/content/gdrive/My Drive/TASR/kitaplar/{0}/{1}_output/tuned.json'.format(book_name, chapter)) as f: 
    syncmap = json.loads(f.read())
syncmap
#Second Step
sentences = []
for fragment in syncmap['fragments']:
    if ((float(fragment['end'])*1000) - float(fragment['begin'])*1000) > 400:

This file has been truncated. show original

I found some solutions on Google like

data = pd.read_csv('file1.csv', error_bad_lines=False)

Also as error output, I may solve with setting

engine=‘python’

But, I couldn’t figure out where I should change.

So, where should I edit to fix this issue.

Thanks.

lissyx · January 25, 2019, 12:37pm

You need to pass a CSV file, here you are giving a directory.