Help with japanese model

Using the --reverse_train flag immediatly causes the program to go OOM. Apprently deepspeech sorts the files so large files are at the bottom - https://github.com/mozilla/DeepSpeech/issues/2513.
Anything above 4 batch size crashes.
I think the --reverse_train flag would be a useful tip for topics such as What is the ideal batch size?

1 Like

Try train batch of 1, it will take longer, but with reverse you’ll see whether it would work.

Its working now with 4 reversed and it hasnt crashed - so i am pretty confident it will complete 1 epoch this time.

Also just wanted to confirm i have converted to utf-8 like this

ただまごまごするだけであった。夫人はそれを見澄してこういった。「誤解しちゃいけませんよ。私は私、

Into

\xE3\x81\x9F\xE3\x81\xA0\xE3\x81\xBE\xE3\x81\x94\xE3\x81\xBE\xE3\x81\x94\xE3\x81\x99\xE3\x82\x8B\xE3\x81\xA0\xE3\x81\x91\xE3\x81\xA7\xE3\x81\x82\xE3\x81\xA3\xE3\x81\x9F\xE3\x80\x82\xE5\xA4\xAB\xE4\xBA\xBA\xE3\x81\xAF\xE3\x81\x9D\xE3\x82\x8C\xE3\x82\x92\xE8\xA6\x8B\xE6\xBE\x84\xE3\x81\x97\xE3\x81\xA6\xE3\x81\x93\xE3\x81\x86\xE3\x81\x84\xE3\x81\xA3\xE3\x81\x9F\xE3\x80\x82\xE3\x80\x8C\xE8\xAA\xA4\xE8\xA7\xA3\xE3\x81\x97\xE3\x81\xA1\xE3\x82\x83\xE3\x81\x84\xE3\x81\x91\xE3\x81\xBE\xE3\x81\x9B\xE3\x82\x93\xE3\x82\x88\xE3\x80\x82\xE7\xA7\x81\xE3\x81\xAF\xE7\xA7\x81\xE3\x80\x81

So each record in the csv file looks like this -

/home/anon/Downloads/jaSTTDatasets/processedAudio/19752.wav,100070,\xE7\xB4\xA0\xE6\x99\xB4\xE3\x82\x89\xE3\x81\x97\xE3\x81\x84\xE8\xAA\x95\xE7\x94\x9F\xE6\x97\xA5\xE3\x82\x92\xE8\xBF\x8E\xE3\x81\x88\xE3\x82\x89\xE3\x82\x8C\xE3\x81\xBE\xE3\x81\x99\xE3\x82\x88\xE3\x81\x86\xE3\x81\xAB\xE3\x80\x82

Is this correct?

it needs to stay this way

1 Like

Update:

  1. Files causing inf loss -
    This issue was fixed after fixing my transcripts from hex notation to proper utf-8 encoded chars.
  2. OOM errors -
    Fixed by reducing batch size to something my gpu could handle, in my case 4.

I was able to get it train for 2 epochs successfully, however i encountered another issue i have been stuck on

I wanted to test the model after 2 epochs, when the tests are run it returns the error -

I Loading variable from checkpoint: global_step
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_6/bias
I Loading variable from checkpoint: layer_6/weights
Testing model on /home/anon/Downloads/jaSTTDatasets/final-test.csv
Test epoch | Steps: 0 | Elapsed Time: 0:00:00                                                      Traceback (most recent call last):
  File "DeepSpeech.py", line 12, in <module>
    ds_train.run_script()
  File "/DeepSpeech/training/deepspeech_training/train.py", line 982, in run_script
    absl.app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/DeepSpeech/training/deepspeech_training/train.py", line 958, in main
    test()
  File "/DeepSpeech/training/deepspeech_training/train.py", line 682, in test
    samples = evaluate(FLAGS.test_files.split(','), create_model)
  File "/DeepSpeech/training/deepspeech_training/evaluate.py", line 132, in evaluate
    samples.extend(run_test(init_op, dataset=csv))
  File "/DeepSpeech/training/deepspeech_training/evaluate.py", line 114, in run_test
    cutoff_prob=FLAGS.cutoff_prob, cutoff_top_n=FLAGS.cutoff_top_n)
  File "/usr/local/lib/python3.6/dist-packages/ds_ctcdecoder/__init__.py", line 228, in ctc_beam_search_decoder_batch
    for beam_results in batch_beam_results
  File "/usr/local/lib/python3.6/dist-packages/ds_ctcdecoder/__init__.py", line 228, in <listcomp>
    for beam_results in batch_beam_results
  File "/usr/local/lib/python3.6/dist-packages/ds_ctcdecoder/__init__.py", line 227, in <listcomp>
    [(res.confidence, alphabet.Decode(res.tokens)) for res in beam_results]
  File "/usr/local/lib/python3.6/dist-packages/ds_ctcdecoder/__init__.py", line 138, in Decode
    return res.decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: invalid continuation byte

Naturally i assumed its utf8 issue - and i need to fix my files. However i have tried almost everything to fix the file and nothing seems to work. Note that it works for training and validation - however during tests it fails.

anon@anon-Lenovo-Legion-Y540-15IRH-PG0:~$ isutf8 ./Downloads/jaSTTDatasets/final-dev.csv 
anon@anon-Lenovo-Legion-Y540-15IRH-PG0:~$ isutf8 ./Downloads/jaSTTDatasets/final-train.csv 
anon@anon-Lenovo-Legion-Y540-15IRH-PG0:~$ isutf8 ./Downloads/jaSTTDatasets/final-test.csv 
anon@anon-Lenovo-Legion-Y540-15IRH-PG0:~$ isutf8 ./Downloads/jaSTTDatasets/new
newLogs.txt  newutf.csv   
anon@anon-Lenovo-Legion-Y540-15IRH-PG0:~$ isutf8 ./Downloads/jaSTTDatasets/newutf.csv 
anon@anon-Lenovo-Legion-Y540-15IRH-PG0:~$ iconv -f UTF-8 ./Downloads/jaSTTDatasets/newutf.csv -o /dev/null; echo $?
0
anon@anon-Lenovo-Legion-Y540-15IRH-PG0:~$ iconv -f UTF-8 ./Downloads/jaSTTDatasets/final-test.csv -o /dev/null; echo $?
0
anon@anon-Lenovo-Legion-Y540-15IRH-PG0:~$ iconv -f UTF-8 ./Downloads/jaSTTDatasets/final-train.csv -o /dev/null; echo $?
0
anon@anon-Lenovo-Legion-Y540-15IRH-PG0:~$ iconv -f UTF-8 ./Downloads/jaSTTDatasets/final-dev.csv -o /dev/null; echo $?
0

As you can see the files seem to be properly utf 8 encoded.

Here is the test csv file - final-test.zip (349 Bytes)

My docker file if your curious about the environment i am building in -

# Please refer to the TRAINING documentation, "Basic Dockerfile for training"

FROM tensorflow/tensorflow:1.15.4-gpu-py3
ENV DEBIAN_FRONTEND=noninteractive

ENV DEEPSPEECH_REPO=https://github.com/mozilla/DeepSpeech.git
ENV DEEPSPEECH_SHA=origin/master

RUN apt-get update && apt-get install -y --no-install-recommends \
        apt-utils \
        bash-completion \
        build-essential \
        cmake \
        curl \
        git \
        libboost-all-dev \
        libbz2-dev \
        locales \
        python3-venv \
        unzip \
        wget

# We need to remove it because it's breaking deepspeech install later with
# weird errors about setuptools
RUN apt-get purge -y python3-xdg

# Install dependencies for audio augmentation
RUN apt-get install -y --no-install-recommends libopus0 libsndfile1

# Try and free some space
RUN rm -rf /var/lib/apt/lists/*

WORKDIR /
RUN git clone $DEEPSPEECH_REPO DeepSpeech

WORKDIR /DeepSpeech
RUN git checkout $DEEPSPEECH_SHA

# Build CTC decoder first, to avoid clashes on incompatible versions upgrades
RUN cd native_client/ctcdecode && make NUM_PROCESSES=$(nproc) bindings
RUN pip3 install --upgrade native_client/ctcdecode/dist/*.whl

# Prepare deps
RUN pip3 install --upgrade pip==20.2.2 wheel==0.34.2 setuptools==49.6.0

# Install DeepSpeech
#  - No need for the decoder since we did it earlier
#  - There is already correct TensorFlow GPU installed on the base image,
#    we don't want to break that
RUN DS_NODECODER=y DS_NOTENSORFLOW=y pip3 install --upgrade -e .

# Tool to convert output graph for inference
RUN python3 util/taskcluster.py --source tensorflow --branch r1.15 \
        --artifact convert_graphdef_memmapped_format  --target .

# Build KenLM to generate new scorers
WORKDIR /DeepSpeech/native_client
RUN rm -rf kenlm && \
	git clone https://github.com/kpu/kenlm && \
	cd kenlm && \
	git checkout 87e85e66c99ceff1fab2500a7c60c01da7315eec && \
	mkdir -p build && \
	cd build && \
	cmake .. && \
	make -j $(nproc)
WORKDIR /DeepSpeech

ENV TF_FORCE_GPU_ALLOW_GROWTH=true

RUN apt-get update
RUN apt-get install vim -y

RUN sed -i 's/tfv1.nn.ctc_loss(labels=batch_y, inputs=logits, sequence_length=batch_seq_len)/tfv1.nn.ctc_loss(labels=batch_y, inputs=logits, sequence_length=batch_seq_len, ignore_longer_outputs_than_inputs=True)/g' training/deepspeech_training/train.py
RUN sed -i 's/sequence_length=batch_x_len)/sequence_length=batch_x_len, ignore_longer_outputs_than_inputs=True)/g' training/deepspeech_training/evaluate.py

Glad to see that training is working now. Here are some ideas:

  1. Do a test with data from training. You know that it is working, so it has to be sth else.

  2. Run just a test set from the checkpoint to not train again.

  3. It happens during beam search, so maybe the scorer. Use current master or release. Don’t know what commit that is.

  4. Check how you built the scorer. The input for that has to be correct UTF-8 as well. Your docker doesn’t show how you do that.

I already did this, but i got the same issue…

I dont understand, could you reiterate

Currently i am not using a scorer(not sure if it uses a default scorer) . The docker file pulls from current master.

Currently not using scorer.

PS

Is it ok if i build the scorer after some training, or will i need to retrain after i build the scorer. Will i have to do training all over again if i start using a scorer?

Also the loss is 270 - so maybe deepspeech is predicting some undefined charectors which cannot be decoded by utf-8

Maybe i should retry after i get a loss of under 100 to get valid results?

  1. A high loss after 2 epochs is not concerning, you have to look at both losses over time.

  2. The error will persist even with a loss of 10 as it has to do with how files are read. Check the points mentioned aboce.

OK, didn’t read this at first. As we don’t have the command you use it is hard to tell. How do you start testing?

How do you test without training?

Please read the docs carefully and understand how DeepSpeech works. Currently you don’t. This will probably lead to bad results and a bad model …

python -u DeepSpeech.py --test_files /home/anon/Downloads/jaSTTDatasets/final-test.csv --test_batch_size 4 --epochs 5 --bytes_output_mode --checkpoint_dir /home/anon/Downloads/jaSTTDatasets/checkpoint/

So when i run this command it starts testing

My command for training is

python -u DeepSpeech.py --train_files /home/anon/Downloads/jaSTTDatasets/final-train.csv --train_batch_size 4 --dev_files /home/anon/Downloads/jaSTTDatasets/final-dev.csv --dev_batch_size 4 --test_files /home/anon/Downloads/jaSTTDatasets/final-test.csv --test_batch_size 4 --epochs 5 --bytes_output_mode --checkpoint_dir /home/anon/Downloads/jaSTTDatasets/checkpoint

You have to understand how deep learning works. Please start reading somewhere, maybe in Japanese, what happens in machine learning. DeepSpeech ist currently not yet an end user product. You won’t get results if you continue without any knowledge of what happens.

You are asking about high loss due to an UTF-8 error. You use a batch size of 4 for a single line. You set epochs to 5 for a single test run.

And again, 70 hours of input is not much, you won’t get results anywhere near what Google, … do.

Update on this -
Seems like these guys were having the same issue - Training Traditional Chinese for Common Voice using Deep Speech

I used thier ‘ignore’ solution and added a few debug statements in - /usr/local/lib/python3.6/dist-packages/ds_ctcdecoder/__init__.py

def Decode(self, input):
    '''Decode a sequence of labels into a string.'''
    res = super(UTF8Alphabet, self).Decode(input)
    print("utf8 Decode function called")
    print(res)
    return res.decode('utf-8','ignore')

My test csv has only 1 record

When i test the following logs get printed -

root@6e061f9543ba:/DeepSpeech# python -u DeepSpeech.py --test_files /home/anon/Downloads/jaSTTDatasets/final-test.csv --test_batch_size 4 --epochs 1 --bytes_output_mode --checkpoint_dir /home/anon/Downloads/jaSTTDatasets/checkpoint/
I Loading best validating checkpoint from /home/anon/Downloads/jaSTTDatasets/checkpoint/best_dev-13477
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/kernel
I Loading variable from checkpoint: global_step
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_6/bias
I Loading variable from checkpoint: layer_6/weights
Testing model on /home/anon/Downloads/jaSTTDatasets/final-test.csv
Test epoch | Steps: 0 | Elapsed Time: 0:00:00                                                       utf8 Decode function called
b'\xe3\x81\xe3\x81\xe3\x81\xe3\x81\xe3\x81\xe3\x81\xe3\x81\xe3\x81\xe3\x81\xe3\x81\xe3\x81\x8b\xe3\x80\x82'
utf8 Decode function called
b'\xe3\x81\x93\xe3\x81\xae\xe6\x96\x99\xe7\x90\x86\xe3\x81\xaf\xe5\x8d\xb5\xe3\x82\x92\xe4\xba\x8c\xe5\x80\x8b\xe4\xbd\xbf\xe3\x81\x84\xe3\x81\xbe\xe3\x81\x99\xe3\x80\x82'
Test epoch | Steps: 1 | Elapsed Time: 0:00:21                                                       
Test on /home/anon/Downloads/jaSTTDatasets/final-test.csv - WER: 1.000000, CER: 0.928571, loss: 116.681183
--------------------------------------------------------------------------------
Best WER: 
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.928571, loss: 116.681183
 - wav: file:///home/anon/Downloads/jaSTTDatasets/processedAudio/1254.wav
 - src: "この料理は卵を二個使います。"
 - res: "か。"
--------------------------------------------------------------------------------
Median WER: 
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.928571, loss: 116.681183
 - wav: file:///home/anon/Downloads/jaSTTDatasets/processedAudio/1254.wav
 - src: "この料理は卵を二個使います。"
 - res: "か。"
--------------------------------------------------------------------------------
Worst WER: 
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.928571, loss: 116.681183
 - wav: file:///home/anon/Downloads/jaSTTDatasets/processedAudio/1254.wav
 - src: "この料理は卵を二個使います。"
 - res: "か。"
--------------------------------------------------------------------------------

When you encode the hex from the first call using https://mothereff.in/utf-8 its invalid, however when you encode the second call it gets encoded to - この料理は卵を二個使います。 which matches the transcript in my csv.

I am assuming the function is called once to decode the output predicted by the model and a second time to decode the transcript in the csv - this supports my hypothesis.

1 Like

I am happy that you put in the time to solve this. Only you have all the infos and if you give us only bits and pieces, it is hard to suggest solutions.

If you have the time, put in a PR that solves this issue for future Japanese/Chinese models.

I am not sure this is the best solution since i am not a python programmer. The same decode function is being used in alot of places and in some of those places we may not want to ‘ignore’. Also i am not sure how python package manager works - this code was from the ctcdecoder package and not from deepspeech.

Basically not sure i would be the best person to make the change.

I logged an issue here - https://github.com/mozilla/DeepSpeech/issues/3477

I have started trying to integrate augmentation into the training, however i seemed to be stuck on the ‘overlay’ augmentation.

I could not find the csv/sdb format for overlay augmentation in any of the documentation/discourse. Could you please share the format?

Also in the releases - https://github.com/mozilla/DeepSpeech/releases/tag/v0.9.3, its stated that for the augment overlay, the background noise dataset is from a freesound.org and voices dataset is from librevox. Could you please provide updated links for the datasets since i wish to use them as well?

The releases page currently shares the hyperparameters for the english model. Could you please share the hyperparameters for the chinese model as well?

Could you also clarify how much training data i would require to get around 70-90% accuracy? I have managed to get my hands on 8000h of poor-moderate quality data. However i want to use the minimum amount (maybe 1000h) since it would take a long time to train for little % gain(which at the moment is not important for me).

Found the format in ./bin/run-tc-sample_augmentations.sh

Using https://github.com/karolpiczak/ESC-50 and dev from this http://www.openslr.org/12 for overlay. With above csv format, please correct me if there are better options!

For some other data there was a special agreement to use data but not to share it publicly. I guess they would share if they could.

This release is still experimental and @reuben will share that info eventually. But don’t count on a fast answer. It’s almost Christmas here in Europe.

For what? Get 1000 hours and compare to a 750 hour model and you’ll know. Generally you’ll want lots of great training data. What else would you need noise for if your data is already noisy?

1 Like

Did you success to train Japanese model?
Could you give us some instructions step by step?

Thank in advance.