I have been trying to train my own language model with deepspeech, but I couldn’t make it.
I checked my alphabet.txt file several times.But it seems there are unnecessary blank/space characters in my csv files…
My error is:
/content/gdrive/My Drive/cv_tr/data/DeepSpeech
Preprocessing ['/content/gdrive/My Drive/cv_tr/cv-tr-train/cv-tr-train_clean_five.csv']
WARNING:root:frame length (768) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.
WARNING:root:frame length (768) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.
WARNING:root:frame length (768) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.
WARNING:root:frame length (768) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.
WARNING:root:frame length (768) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.
WARNING:root:frame length (768) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.
WARNING:root:frame length (768) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.
WARNING:root:frame length (768) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.
Traceback (most recent call last):
File "/content/gdrive/My Drive/cv_tr/data/DeepSpeech/util/text.py", line 31, in label_from_string
return self._str_to_label[string]
KeyError: ' '
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "DeepSpeech.py", line 940, in <module>
tf.app.run(main)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "DeepSpeech.py", line 892, in main
train()
File "DeepSpeech.py", line 392, in train
hdf5_cache_path=FLAGS.train_cached_features_path)
File "/content/gdrive/My Drive/cv_tr/data/DeepSpeech/util/preprocess.py", line 68, in preprocess
out_data = pmap(step_fn, source_data.iterrows())
File "/content/gdrive/My Drive/cv_tr/data/DeepSpeech/util/preprocess.py", line 13, in pmap
results = pool.map(fun, iterable)
File "/usr/lib/python3.6/multiprocessing/pool.py", line 288, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/usr/lib/python3.6/multiprocessing/pool.py", line 670, in get
raise self._value
File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/usr/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "/content/gdrive/My Drive/cv_tr/data/DeepSpeech/util/preprocess.py", line 23, in process_single_file
transcript = text_to_char_array(file.transcript, alphabet)
File "/content/gdrive/My Drive/cv_tr/data/DeepSpeech/util/text.py", line 56, in text_to_char_array
return np.asarray([alphabet.label_from_string(c) for c in original])
File "/content/gdrive/My Drive/cv_tr/data/DeepSpeech/util/text.py", line 56, in <listcomp>
return np.asarray([alphabet.label_from_string(c) for c in original])
File "/content/gdrive/My Drive/cv_tr/data/DeepSpeech/util/text.py", line 35, in label_from_string
).with_traceback(e.__traceback__)
File "/content/gdrive/My Drive/cv_tr/data/DeepSpeech/util/text.py", line 31, in label_from_string
return self._str_to_label[string]
KeyError: 'ERROR: Your transcripts contain characters which do not occur in data/alphabet.txt! Use util/check_characters.py to see what characters are in your {train,dev,test}.csv transcripts, and then add all these to data/alphabet.txt.'
How can I clean my csv file from these unnecessary space/blank characters?
Also my alphabet.txt
Each line in this file represents the Unicode codepoint (UTF-8 encoded)
associated with a numeric label.
A line that starts with # is a comment. You can escape it with # if you wish
to use ‘#’ as a label.
a
X
5
6
*
A
Y
9
4
i
P
ç
q
t
â
C
ö
J
K
m
U
O
ü
7
r
.
L
y
a
R
Ç
e
k
V
F
N
w
D
_
s
d
Thanks in advance.