Getting res="" when training common voice tamil data

Hi, I am trying to train the common voice tamil data using DeepSpeech.
I am using Nvidia T4 GPU. While it’s successfully getting trained without any errors, I am getting res="" for all of the test sets. What am I doing wrong?

Training Code:

./DeepSpeech.py --train_files /data/ta/clips/train.csv --dev_files /data/ta/clips/dev.csv --test_files /data/ta/clips/test.csv --epochs 30 --utf8 true --train_batch_size 30 --test_batch_size 10 --dev_batch_size 10 --test_output_file ../test_output/text_results.txt --summary_dir ../model_summary_tm/ --export_dir ../exported_model_tm/

Test Results:

Test on /data/ta/clips/test.csv - WER: 1.000000, CER: 1.000000, loss: 170.356705
--------------------------------------------------------------------------------
  Best WER:
  --------------------------------------------------------------------------------
  WER: 1.000000, CER: 1.000000, loss: 438.392761
- wav: file:///data/ta/clips/common_voice_ta_19341627.wav
- src: "எப்பொருள் யார்யார்வாய்க் கேட்பினும் அப்பொருள் மெய்ப்பொருள் காண்பதறிவு"
- res: ""
--------------------------------------------------------------------------------
  WER: 1.000000, CER: 1.000000, loss: 390.568909
- wav: file:///data/ta/clips/common_voice_ta_19137807.wav
- src: "தீயினால் சுட்ட புண் உள்ளாறும் ஆறாதே நாவினால் சுட்ட வடு"
- res: ""
--------------------------------------------------------------------------------
  WER: 1.000000, CER: 1.000000, loss: 382.037537
- wav: file:///data/ta/clips/common_voice_ta_19683442.wav
- src: "எனைத்திட்பம் எய்தியக் கண்ணும் வினைத்திட்பம் வேண்டாரை வேண்டாது உலகு"
- res: ""
--------------------------------------------------------------------------------
  WER: 1.000000, CER: 1.000000, loss: 376.565704
- wav: file:///data/ta/clips/common_voice_ta_19294243.wav
- src: "ஓரிடத்தில் நிலவும் முப்பது ஆண்டுகளுக்கான சராசரி வானிலையே 'காலநிலை' எனப்படு
- res: ""
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.000000, loss: 364.905273
- wav: file:///data/ta/clips/common_voice_ta_19140270.wav
- src: "'தமிழ் மறவன் பட்டாம்பூச்சி' தமிழக அரசின் சின்னமாக அறிவிக்கப்பட்டுள்ளது"
- res: ""
--------------------------------------------------------------------------------
Median WER:
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.000000, loss: 167.159027
- wav: file:///data/ta/clips/common_voice_ta_19340120.wav
- src: "வல்லமை கேட்டிருந்தால் அதைக் கூறாய்"
- res: ""
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.000000, loss: 167.120743
- wav: file:///data/ta/clips/common_voice_ta_19345236.wav
- src: "தோன்றிற்று மங்கை தூக்கம் நீங்காது"
- res: ""
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.000000, loss: 167.071304
- wav: file:///data/ta/clips/common_voice_ta_19140179.wav
- src: "அதிவிரைவில் நீர்நிரப ராதி என்ப"
- res: ""
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.000000, loss: 166.989258
- wav: file:///data/ta/clips/common_voice_ta_19083960.wav
- src: "வஞ்சி கவனித்தாள் சத்தம் வரும்வழியாய்"
- res: ""
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.000000, loss: 166.886429
- wav: file:///data/ta/clips/common_voice_ta_19816059.wav
- src: "எனைஇழந்தேன் உன்னெழிலில் கலந்த தாலே"
- res: ""
--------------------------------------------------------------------------------
Worst WER:
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.000000, loss: 65.682457
- wav: file:///data/ta/clips/common_voice_ta_19423203.wav
- src: "பீடு பெற நில்"
- res: ""
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.000000, loss: 64.212181
- wav: file:///data/ta/clips/common_voice_ta_19340349.wav
- src: "மிக்க நன்றி"
- res: ""
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.000000, loss: 60.833313
- wav: file:///data/ta/clips/common_voice_ta_19422359.wav
- src: "ஒரே சிரிப்பு"
- res: ""
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.000000, loss: 44.398792
- wav: file:///data/ta/clips/common_voice_ta_19422346.wav
- src: "உழைப்பு"
- res: ""
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.000000, loss: 41.551353
- wav: file:///data/ta/clips/common_voice_ta_19340193.wav
- src: "இயற்கை"
- res: ""
--------------------------------------------------------------------------------

Regards,
Tushar

@lissyx might know better, but I would start testing a small English dataset and see whether that’s working. I think I remember something with T4 being untested, but that might have changed. If that works, start with Tamil. Maybe use a smaller subset of characters and broaden more and more. Maybe alphabet or language model problems?

And if you still run into problems, please state version, general setup, size of your data sets and maybe some console output :slight_smile: That makes it so much simpler to help you.

Thanks! Will test with English data and get back.

Hi, while I am training on English data, below is an error I keep getting when I try to train without the ‘–utf8=true’ flag in my training code. I am not sure, but I feel its related to the res=’’ issue I mentioned above. Any Idea why I keep getting the error without the ‘–utf8=true’ flag when my alphabet.txt file is fine?

I am using the common voice Tamil data which is roughly 150mb and 7hrs of data. I am also using the following package versions:
deepspeech 0.6.1
tensorflow-gpu 1.15.0
ds-ctcdecoder 0.7.0a2

My alphabet.txt file:

# Each line in this file represents the Unicode codepoint (UTF-8 encoded)
# associated with a numeric label.
# A line that starts with # is a comment. You can escape it with \# if you wish
# to use '#' as a label.
 
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
ஒ
ரு
நா
ளை
இ
ன்
ப
ம்
பொ
று
த்
தா
ர்
க்
கு
ப்
பொ
ன்
று
ந்
து
ணை
யு
ம்
பு
க
ழ்
ி
ழ
வ
ர
்
ா
ய
ல
ை
ஂ
ஃ
அ
ஆ
இ
ஈ
உ
ஊ
எ
ஏ
ஐ
ஒ
ஓ
ஔ
க
ங
ச
ஜ
ஞ
ட
ண
த
ந
ன
ப
ம
ய
ர
ற
ல
ள
ழ
வ
ஶ
ஷ
ஸ
ஹ
ா
ி
ீ
ு
ூ
ெ
ே
ை
ொ
ோ
ௌ
்
ௐ
ௗ
0
௧
௨
௩
௪
௫
௬
௭
௮
௯
௰
௱
௲
௳
௴
௵
௶
௷
௸
௹
௺
'
# The last (non-comment) line needs to end with a newline.

My import code (runs successfully):

bin/import_cv2.py --filter_alphabet data/alphabet.txt /data/ta

My Training Code (no --utf8 flag):

./DeepSpeech.py --train_files /data/ta/clips/train.csv --dev_files /data/ta/clips/dev.csv --test_files /data/ta/clips/test.csv --epochs 30 --train_batch_size 30 --test_batch_size 10 --dev_batch_size 10 --test_output_file ../test_output/text_results.txt --summary_dir ../model_summary_tm/ --export_dir ../exported_model_tm/

Console Output/Error:

I Loading best validating checkpoint from /home/ubuntu/.local/share/deepspeech/checkpoints/best_dev-36999
I Loading variable from checkpoint: beta1_power
I Loading variable from checkpoint: beta2_power
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias/Adam
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias/Adam_1
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/kernel
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/kernel/Adam
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/kernel/Adam_1
I Loading variable from checkpoint: global_step
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/bias/Adam
I Loading variable from checkpoint: layer_1/bias/Adam_1
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_1/weights/Adam
I Loading variable from checkpoint: layer_1/weights/Adam_1
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/bias/Adam
I Loading variable from checkpoint: layer_2/bias/Adam_1
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_2/weights/Adam
I Loading variable from checkpoint: layer_2/weights/Adam_1
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/bias/Adam
I Loading variable from checkpoint: layer_3/bias/Adam_1
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_3/weights/Adam
I Loading variable from checkpoint: layer_3/weights/Adam_1
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/bias/Adam
I Loading variable from checkpoint: layer_5/bias/Adam_1
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_5/weights/Adam
I Loading variable from checkpoint: layer_5/weights/Adam_1
I Loading variable from checkpoint: layer_6/bias
Traceback (most recent call last):
  File "./DeepSpeech.py", line 900, in <module>
  absl.app.run(main)
File "/home/ubuntu/deepspeech-train-venv/lib/python3.5/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/home/ubuntu/deepspeech-train-venv/lib/python3.5/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "./DeepSpeech.py", line 873, in main
train()
File "./DeepSpeech.py", line 505, in train
load_or_init_graph(session, method_order)
File "/home/ubuntu/DeepSpeech/util/checkpoints.py", line 103, in load_or_init_graph
return _load_checkpoint(session, ckpt_path)
File "/home/ubuntu/DeepSpeech/util/checkpoints.py", line 70, in _load_checkpoint
v.load(ckpt.get_tensor(v.op.name), session=session)
File "/home/ubuntu/deepspeech-train-venv/lib/python3.5/site-packages/tensorflow_core/python/util/deprecation.py", line

324, in new_func
return func(*args, **kwargs)
File "/home/ubuntu/deepspeech-train-venv/lib/python3.5/site-packages/tensorflow_core/python/ops/variables.py", line 10

33, in load
session.run(self.initializer, {self.initializer.inputs[1]: value})
File "/home/ubuntu/deepspeech-train-venv/lib/python3.5/site-packages/tensorflow_core/python/client/session.py", line 9

56, in run
run_metadata_ptr)
File "/home/ubuntu/deepspeech-train-venv/lib/python3.5/site-packages/tensorflow_core/python/client/session.py", line 1

156, in _run
(np_val.shape, subfeed_t.name, str(subfeed_t.get_shape())))
ValueError: Cannot feed value of shape (256,) for Tensor 'layer_6/bias/Initializer/zeros:0', which has shape '(137,)'

You should only be using that flag in cases you know what you are doing.

This is not a training dependency, please remove.

I see english and non english in that, is it expected ?

It looks like you are re-loading a previous checkpoint, and the shapes are incompatibles. You changed the alphabet between both ?

1 Like

And try to stick to one Deepspeech version 0.6 or 0.7, it changes quite a lot in between as it is in active development. If I were starting, I would probably go for the 0.6 branch as you find more docs, issues and threads here :slight_smile:

2 Likes

That, and current 0.7.0a2 published artifacts are a bit old now, we should have some newer alpha somehow :smiley:

You are right, the issue was related to incompatible shapes. Once I cleared the checkpoints folder there was no error. But the test results were still null. For the Tamil data English alphabets are not expected. I removed the English alphabets and trained again. But the result was still null.

Also, I ran training on a very small subset of the common voice english data: 150 audio files with 3 minutes of voice data. With the English data I was able to produce some result. But it’s still not producing any result with the Tamil data.

Below is the console output after training the English data:

Test on /data/en/clips/test.csv - WER: 1.000000, CER: 0.808008, loss: 189.738403
--------------------------------------------------------------------------------
Best WER:
--------------------------------------------------------------------------------
WER: 0.875000, CER: 0.794872, loss: 266.495209
 - wav: file:///data/en/clips/common_voice_en_3096.wav
 - src: "so i started making stuff up just by combining the things that i saw around me"
 - res: "i i i i i i i i i i i "
--------------------------------------------------------------------------------
WER: 0.888889, CER: 0.755102, loss: 200.038803
 - wav: file:///data/en/clips/common_voice_en_3309.wav
 - src: "a girls stands by a caricature artist 's drawings"
 - res: "i assassinated in a "
--------------------------------------------------------------------------------
WER: 0.909091, CER: 0.866667, loss: 156.069107
 - wav: file:///data/en/clips/common_voice_en_3308.wav
 - src: "a man in a gray shirt is holding a metal pipe"
 - res: "i see a "
--------------------------------------------------------------------------------
WER: 0.923077, CER: 0.838710, loss: 214.126221
 - wav: file:///data/en/clips/common_voice_en_3290.wav
 - src: "a man with a can walks past a painting of a construction scene"
 - res: "i see it is a "
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.800000, loss: 211.179337
 - wav: file:///data/en/clips/common_voice_en_3305.wav
 - src: "man sits among bicycles while adjusting the tire on one"
 - res: "i see it is a "
--------------------------------------------------------------------------------
Median WER:
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.909091, loss: 183.076401
 - wav: file:///data/en/clips/common_voice_en_3307.wav
 - src: "two people are playing golf on a golf course"
 - res: "i assented "
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.895833, loss: 180.843323
 - wav: file:///data/en/clips/common_voice_en_3306.wav
 - src: "a small black and white dog is swimming in water"
 - res: "i seen "
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.886364, loss: 170.807999
 - wav: file:///data/en/clips/common_voice_en_3316.wav
 - src: "these young people are relaxing on the grass"
 - res: "i seen "
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.863636, loss: 165.001678
 - wav: file:///data/en/clips/common_voice_en_3291.wav
 - src: "two ladies running in front of a cocacola ad"
 - res: "i assented "
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.800000, loss: 163.207550
 - wav: file:///data/en/clips/common_voice_en_3293.wav
 - src: "a man sits on an indoor bench and smiles"
 - res: "i assented to "
--------------------------------------------------------------------------------
Worst WER:
--------------------------------------------------------------------------------
WER: 1.076923, CER: 0.772727, loss: 272.909973
 - wav: file:///data/en/clips/common_voice_en_641.wav
 - src: "add black yankee rock played by pop punk's not dead in my playlist"
 - res: "e e e e e e e e e e e e e e "
--------------------------------------------------------------------------------
WER: 1.222222, CER: 0.711538, loss: 196.695816
 - wav: file:///data/en/clips/common_voice_en_3130.wav
 - src: "book spot at a highly rated restaurant in tajikistan"
 - res: "a a a a a a a a a a a a "
--------------------------------------------------------------------------------
WER: 1.333333, CER: 0.735849, loss: 212.640320
 - wav: file:///data/en/clips/common_voice_en_3292.wav
 - src: "a group of men in an office looking at a large screen"
 - res: "i i i i i i i i i i i i i i i i "
--------------------------------------------------------------------------------
WER: 1.400000, CER: 0.714286, loss: 135.872665
 - wav: file:///data/en/clips/common_voice_en_3295.wav
 - src: "woman posing with a painting"
 - res: "i i i i i i i "
--------------------------------------------------------------------------------
WER: 1.625000, CER: 0.808511, loss: 207.264389
 - wav: file:///data/en/clips/common_voice_en_3310.wav
 - src: "a group of five female children reason outdoors"
 - res: "i i i i i i i i i i i i i "
--------------------------------------------------------------------------------
I Exporting the model...

How much tamil do you have ? Do you train from scratch ?

I have 135 mb of data. Just a little over than 1 hr of training data. I am training it from scratch following each step mentioned here: https://github.com/mozilla/DeepSpeech/blob/master/doc/TRAINING.rst#training-your-own-model

I installed the packages mentioned, imported the data, edited the alphabet.txt file and started the training. Is there any other step in between?

1 Like

With one hour of data, you can’t really train anything at all, it’s not surprising you get "".

Yes I read that on a few threads on this forum, but my English data was around 3 mins and still it produced some result.

Yes, but you are not sharing the whole picture of your training: model size, epochs, alphabet size etc.

I have tried with the default 75 epochs first. Then I reduced the epochs to 30. The alphabet.txt file contains more than 100 Tamil alphabets and my exported model(.pb) is of 180mb.

Also, if I use the train data as the test data too, then should I expect some result?

so that’s much more complicated than english model

hard to tell.

I’m not sure to understand what you are chasing here, when you obviously have not enough data to do anything useful. Can you explain more ?

I am trying to build a speech to text for some regional languages in India. For example, Hindi, Tamil, Malayalam etc. Since the common voice Tamil data was readily available, I am trying to get some result out of it and check the accuracy first, since it’s a huge task to generate substantial training data for each language. Will you be able to try training the Tamil data once at your end and see if any results are thrown?

I also tried with the common voice Irish Dataset, which is also around 1hr of data. I just had 4 extra alphabets to add in the alphabet file, but even for that the result was null.

1 Like

No, I don’t have time for that.

With just one hour, you are not going to be able to check anything valuable.

1 Like

Well, again, very small dataset. Push epochs much higher to force overfitting, but that is not going to give you any valuable insight anyway.

Please triple check your language model, trie and your ds_ctcdecoder setup as well.

1 Like

Sure! Do I need to explicitly create the language model and trie? If yes, then can you please point me to a doc which I can refer to?