KeyError: ' ' while training with my own data

I have been trying to train my own language model with deepspeech, but I couldn’t make it.

I checked my alphabet.txt file several times.But it seems there are unnecessary blank/space characters in my csv files…

My error is:

/content/gdrive/My Drive/cv_tr/data/DeepSpeech
Preprocessing ['/content/gdrive/My Drive/cv_tr/cv-tr-train/cv-tr-train_clean_five.csv']
WARNING:root:frame length (768) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.
WARNING:root:frame length (768) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.
WARNING:root:frame length (768) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.
WARNING:root:frame length (768) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.
WARNING:root:frame length (768) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.
WARNING:root:frame length (768) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.
WARNING:root:frame length (768) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.
WARNING:root:frame length (768) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.
Traceback (most recent call last):
  File "/content/gdrive/My Drive/cv_tr/data/DeepSpeech/util/text.py", line 31, in label_from_string
    return self._str_to_label[string]
KeyError: ' '

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "DeepSpeech.py", line 940, in <module>
    tf.app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "DeepSpeech.py", line 892, in main
    train()
  File "DeepSpeech.py", line 392, in train
    hdf5_cache_path=FLAGS.train_cached_features_path)
  File "/content/gdrive/My Drive/cv_tr/data/DeepSpeech/util/preprocess.py", line 68, in preprocess
    out_data = pmap(step_fn, source_data.iterrows())
  File "/content/gdrive/My Drive/cv_tr/data/DeepSpeech/util/preprocess.py", line 13, in pmap
    results = pool.map(fun, iterable)
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 288, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 670, in get
    raise self._value
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/content/gdrive/My Drive/cv_tr/data/DeepSpeech/util/preprocess.py", line 23, in process_single_file
    transcript = text_to_char_array(file.transcript, alphabet)
  File "/content/gdrive/My Drive/cv_tr/data/DeepSpeech/util/text.py", line 56, in text_to_char_array
    return np.asarray([alphabet.label_from_string(c) for c in original])
  File "/content/gdrive/My Drive/cv_tr/data/DeepSpeech/util/text.py", line 56, in <listcomp>
    return np.asarray([alphabet.label_from_string(c) for c in original])
  File "/content/gdrive/My Drive/cv_tr/data/DeepSpeech/util/text.py", line 35, in label_from_string
    ).with_traceback(e.__traceback__)
  File "/content/gdrive/My Drive/cv_tr/data/DeepSpeech/util/text.py", line 31, in label_from_string
    return self._str_to_label[string]
KeyError: 'ERROR: Your transcripts contain characters which do not occur in data/alphabet.txt! Use util/check_characters.py to see what characters are in your {train,dev,test}.csv transcripts, and then add all these to data/alphabet.txt.'

How can I clean my csv file from these unnecessary space/blank characters?
Also my alphabet.txt

Each line in this file represents the Unicode codepoint (UTF-8 encoded)

associated with a numeric label.

A line that starts with # is a comment. You can escape it with # if you wish

to use ‘#’ as a label.

a
X
5
6
*
A
Y
9
4
i
P
ç
q
t
â
C
ö
J
K
m
U
O
ü
7
r
.
L
y
a
R
Ç
e
k
V
F
N
w
D
_
s
d

Thanks in advance.

1 Like

Your KeyError is referring to ' ' but I don’t see any space in your alphabet. Why don’t you add space ?

Which character should I add to my alphabet.txt for adding space?

I don’t know, how about a space ?

I tried space and tried ’ ’ but I couldn’t solve it.

Well check your dataset and find the exact character, it might be a more specific space in the unicode space. Print its hex value in the place the KeyError arises, or use utils/check_characters.py to generate your alphabet.

I have written a script to update my alphabet.txt file with words in my transcripts

train_lst = ['?', 'C', 'p', 'A', 'â', 'U', '7', 'H', 't', 's', '8', 'j', 'w', 'o', 'k', 'Ç', 'Ü', 'I', 'q', 'F', '9', 'ü', 'O', '6', 'z', ' ', 'y', 'h', 'f', 'Q', 'ö', 'u', 'İ', 'G', '.', 'i', '2', 'v', 'b', 'ş', 'x', 'E', 'T', '_', '3', 'P', 'ğ', 'S', 'Ş', 'ç', 'r', '5', 'Y', 'l', 'ı', 'D', 'B', 'c', 'd', '*', 'Ö', 'X', 'W', 'K', 'L', 'n', '0', 'g', '4', 'M', 'V', 'J', 'N', 'm', 'e', '!', '1', 'Z', 'a', 'R']
test_lst = ['ü', 'S', 'L', 'p', 'O', 'R', 'a', 'Ş', 'i', 'g', '3', 'z', 'V', 'A', 'M', 'o', 'd', 'n', '1', '?', 'U', 'x', 'T', 'P', 'E', 'ç', 'c', 'b', 'k', 'u', 'v', 'K', 'l', 'r', 'İ', 'â', 'w', 'h', 'j', '.', 'Y', 'Ö', 'ğ', 't', 'e', 'm', 'Z', ' ', 'ş', 'G', 's', 'D', 'I', 'f', 'ö', 'N', 'C', 'Ü', 'y', 'B', 'H', 'F', 'J', 'ı', 'Ç']
dev_lst = ['m', 'N', 't', 'C', 'â', 'a', 'J', 'b', 'F', 'R', 'P', 'u', ' ', 'B', 'y', 'Z', 'z', 'A', 'k', 'n', 'r', 'v', 'ı', 'D', 'G', 'E', 'O', 'H', 'Ü', 'M', 'c', 'x', 'd', 'L', 'K', 'S', 'W', 'p', 'ş', 'T', 'g', 'h', 'İ', 'w', 'Y', '1', 'l', 'Ş', 's', 'j', 'I', 'U', '3', 'ç', '.', 'V', 'Ö', 'f', 'ğ', '?', 'e', 'Ç', 'i', 'o', '0', 'ü', 'ö']
with open('/content/gdrive/My Drive/cv_tr/data/alphabet.txt', 'w') as alphabet:
  for elem1, elem2, elem3 in zip(dev_lst, test_lst, train_lst):
    if elem1 not in all_alpha:
      all_alpha.append(elem1)
    if elem2 not in all_alpha:
      all_alpha.append(elem2)
    if elem3 not in all_alpha:
      all_alpha.append(elem3)
  for keys in all_alpha:
    alphabet.write(str(all_alpha.pop()))
    alphabet.write('\n')
print(all_alpha)

But, I have to find a way to get rid of this space key error.
Thank you, I will search hex value.

Why not using utils/check_characters.py ?

I collected all of my cvs files (test, dev, train) results with

%cd util
!python3 check_characters.py -csv /content/gdrive/My\ Drive/cv_tr/cv-tr-train/cv-tr-train_clean_six.csv

!python3 check_characters.py -csv /content/gdrive/My\ Drive/cv_tr/cv-tr-dev/cv-tr-dev_clean_six.csv

!python3 check_characters.py -csv /content/gdrive/My\ Drive/cv_tr/cv-tr-test/cv-tr-test_clean_six.csv

And with script above added them to my alphabet.txt file.

That’s still a source of error. Please check the help of the script, you can give it all the three CSV and generate a valid alphabet.txt from it. At once. Much less risks.

Thank you lissyx,
My problem is solved.
I copied regular alphabet.txt from DeepSpeech folder which comes after git clone, I have also added characters to alphabet.txt by hand.
It seems my issue is fixed.

1 Like