Tokenizing data error, while training a dataset with deepspeech

mustafaxfe · January 25, 2019, 12:35pm

I have created a speech dataset to train with DeepSpeech while following this( Creating an open speech recognition dataset for (almost) any language | by Andreas Klintberg | Medium ) tutorial.

But, I couldn’t trained my dataset with deepspeech. It gives an error as a result of train command like

python DeepSpeech.py --train_files /mnt/c/wsl/teneke_out_bolum1/

It gives:

pandas.errors.ParserError: Error tokenizing data. C error: Calling >read(nbytes) on source failed. Try engine=‘python’.

I have created dataset after aeneas force allignment and fine tuning with finetuneas:

Here is my code that I used on Google Colab to train with DeepSpeech:

gist.github.com

https://gist.github.com/mustafaxfe/d20be114ca7cea5c47ea5cc85653c761

deepspeech_data.py

#First Step
book = AudioSegment.from_mp3("/content/gdrive/My Drive/TASR/kitaplar/teneke/bolum1/Teneke_Yasar_Kemal_01.mp3")

with open('/content/gdrive/My Drive/TASR/kitaplar/{0}/{1}_output/tuned.json'.format(book_name, chapter)) as f: 
    syncmap = json.loads(f.read())
syncmap
#Second Step
sentences = []
for fragment in syncmap['fragments']:
    if ((float(fragment['end'])*1000) - float(fragment['begin'])*1000) > 400:

This file has been truncated. show original

I found some solutions on Google like

data = pd.read_csv('file1.csv', error_bad_lines=False)

Also as error output, I may solve with setting

engine=‘python’

But, I couldn’t figure out where I should change.

So, where should I edit to fix this issue.

Thanks.

lissyx · January 25, 2019, 12:37pm

You need to pass a CSV file, here you are giving a directory.

Topic		Replies	Views
Error while training Common Voice Data DeepSpeech	6	668	November 14, 2019
Error when trying to train DeepSpeech	7	1887	January 30, 2018
--train_files: command not found DeepSpeech	5	802	January 28, 2019
Error while training the model DeepSpeech	2	311	March 12, 2020
Getting RuntimeError: No transcript data (missing CSV column) when trying to train a model DeepSpeech	12	1605	April 12, 2020

Tokenizing data error, while training a dataset with deepspeech

Related topics