Help with japanese model

Reduced batch size to 8 runninig current epoch for 45 mins will update it if it fails.

I am not familiar with ‘chunk’ in this context. My audio files range from 5-60 seconds currently. Do you want me to reprocess the data to all be similar audio length?
By merge, you mean merge multiple audio files so they are around 60 seconds long so that they are all of similar length?

Ah, this will be the cause of some problems. Ideally chunks/audio segements/wavs have almost the same length 4-8 /10-15 seconds. I would recommend 5-10 seconds.

Ok i will updates this post with my findings after i normalize for audio length

1 Like

I tried a batch size of 8 and it still fails - also it fails at a similar point when using 16 and 24 - fails when near the end.

Epoch 0 |   Training | Elapsed Time: 1:50:51 | Steps: 3471 | Loss: inf                             E The following files caused an infinite (or NaN) loss: /home/anon/Downloads/jaSTTDatasets/processedAudio/18311.wav,/home/anon/Downloads/jaSTTDatasets/processedAudio/14902.wav,/home/anon/Downloads/jaSTTDatasets/processedAudio/13702.wav
Epoch 0 |   Training | Elapsed Time: 1:52:00 | Steps: 3482 | Loss: inf                             Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[13384,2048] and type bool on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[{{node tower_0/dropout_3/GreaterEqual}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[concat/concat/_119]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[13384,2048] and type bool on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[{{node tower_0/dropout_3/GreaterEqual}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "DeepSpeech.py", line 12, in <module>
    ds_train.run_script()
  File "/DeepSpeech/training/deepspeech_training/train.py", line 982, in run_script
    absl.app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/DeepSpeech/training/deepspeech_training/train.py", line 954, in main
    train()
  File "/DeepSpeech/training/deepspeech_training/train.py", line 607, in train
    train_loss, _ = run_set('train', epoch, train_init_op)
  File "/DeepSpeech/training/deepspeech_training/train.py", line 572, in run_set
    feed_dict=feed_dict)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[13384,2048] and type bool on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[node tower_0/dropout_3/GreaterEqual (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[concat/concat/_119]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[13384,2048] and type bool on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[node tower_0/dropout_3/GreaterEqual (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

Original stack trace for 'tower_0/dropout_3/GreaterEqual':
  File "DeepSpeech.py", line 12, in <module>
    ds_train.run_script()
  File "/DeepSpeech/training/deepspeech_training/train.py", line 982, in run_script
    absl.app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/DeepSpeech/training/deepspeech_training/train.py", line 954, in main
    train()
  File "/DeepSpeech/training/deepspeech_training/train.py", line 484, in train
    gradients, loss, non_finite_files = get_tower_results(iterator, optimizer, dropout_rates)
  File "/DeepSpeech/training/deepspeech_training/train.py", line 317, in get_tower_results
    avg_loss, non_finite_files = calculate_mean_edit_distance_and_loss(iterator, dropout_rates, reuse=i > 0)
  File "/DeepSpeech/training/deepspeech_training/train.py", line 244, in calculate_mean_edit_distance_and_loss
    logits, _ = create_model(batch_x, batch_seq_len, dropout, reuse=reuse, rnn_impl=rnn_impl)
  File "/DeepSpeech/training/deepspeech_training/train.py", line 204, in create_model
    layers['layer_5'] = layer_5 = dense('layer_5', output, Config.n_hidden_5, dropout_rate=dropout[5], layer_norm=FLAGS.layer_norm)
  File "/DeepSpeech/training/deepspeech_training/train.py", line 93, in dense
    output = tf.nn.dropout(output, rate=dropout_rate)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 4229, in dropout
    return dropout_v2(x, rate, noise_shape=noise_shape, seed=seed, name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 4313, in dropout_v2
    keep_mask = random_tensor >= rate
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_math_ops.py", line 4481, in greater_equal
    "GreaterEqual", x=x, y=y, name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

I have excluded files that are greater than 2 mb - it shouldnt be possible for 8x2 mb=16 mb to cause a 4gb ram gpu to go out of memory, correct me if there is some behaviour i am unaware about. Most files are around 250kb.
The face that it OOM’s towards then end is suspicious of some kind of memory leak.
Will retry with 4 batch size …

Try to run the files in reverse. There is some flag option for that. If the error is at the start, it is a file.

Batch size might not be the cause. But DeepSpeech, as most ML systems, uses the same feature size for all inputs. Therefore the largest file determines the memory. Try to exclude larger files.

Using the --reverse_train flag immediatly causes the program to go OOM. Apprently deepspeech sorts the files so large files are at the bottom - https://github.com/mozilla/DeepSpeech/issues/2513.
Anything above 4 batch size crashes.
I think the --reverse_train flag would be a useful tip for topics such as What is the ideal batch size?

1 Like

Try train batch of 1, it will take longer, but with reverse you’ll see whether it would work.

Its working now with 4 reversed and it hasnt crashed - so i am pretty confident it will complete 1 epoch this time.

Also just wanted to confirm i have converted to utf-8 like this

ただまごまごするだけであった。夫人はそれを見澄してこういった。「誤解しちゃいけませんよ。私は私、

Into

\xE3\x81\x9F\xE3\x81\xA0\xE3\x81\xBE\xE3\x81\x94\xE3\x81\xBE\xE3\x81\x94\xE3\x81\x99\xE3\x82\x8B\xE3\x81\xA0\xE3\x81\x91\xE3\x81\xA7\xE3\x81\x82\xE3\x81\xA3\xE3\x81\x9F\xE3\x80\x82\xE5\xA4\xAB\xE4\xBA\xBA\xE3\x81\xAF\xE3\x81\x9D\xE3\x82\x8C\xE3\x82\x92\xE8\xA6\x8B\xE6\xBE\x84\xE3\x81\x97\xE3\x81\xA6\xE3\x81\x93\xE3\x81\x86\xE3\x81\x84\xE3\x81\xA3\xE3\x81\x9F\xE3\x80\x82\xE3\x80\x8C\xE8\xAA\xA4\xE8\xA7\xA3\xE3\x81\x97\xE3\x81\xA1\xE3\x82\x83\xE3\x81\x84\xE3\x81\x91\xE3\x81\xBE\xE3\x81\x9B\xE3\x82\x93\xE3\x82\x88\xE3\x80\x82\xE7\xA7\x81\xE3\x81\xAF\xE7\xA7\x81\xE3\x80\x81

So each record in the csv file looks like this -

/home/anon/Downloads/jaSTTDatasets/processedAudio/19752.wav,100070,\xE7\xB4\xA0\xE6\x99\xB4\xE3\x82\x89\xE3\x81\x97\xE3\x81\x84\xE8\xAA\x95\xE7\x94\x9F\xE6\x97\xA5\xE3\x82\x92\xE8\xBF\x8E\xE3\x81\x88\xE3\x82\x89\xE3\x82\x8C\xE3\x81\xBE\xE3\x81\x99\xE3\x82\x88\xE3\x81\x86\xE3\x81\xAB\xE3\x80\x82

Is this correct?

it needs to stay this way

1 Like

Update:

  1. Files causing inf loss -
    This issue was fixed after fixing my transcripts from hex notation to proper utf-8 encoded chars.
  2. OOM errors -
    Fixed by reducing batch size to something my gpu could handle, in my case 4.

I was able to get it train for 2 epochs successfully, however i encountered another issue i have been stuck on

I wanted to test the model after 2 epochs, when the tests are run it returns the error -

I Loading variable from checkpoint: global_step
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_6/bias
I Loading variable from checkpoint: layer_6/weights
Testing model on /home/anon/Downloads/jaSTTDatasets/final-test.csv
Test epoch | Steps: 0 | Elapsed Time: 0:00:00                                                      Traceback (most recent call last):
  File "DeepSpeech.py", line 12, in <module>
    ds_train.run_script()
  File "/DeepSpeech/training/deepspeech_training/train.py", line 982, in run_script
    absl.app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/DeepSpeech/training/deepspeech_training/train.py", line 958, in main
    test()
  File "/DeepSpeech/training/deepspeech_training/train.py", line 682, in test
    samples = evaluate(FLAGS.test_files.split(','), create_model)
  File "/DeepSpeech/training/deepspeech_training/evaluate.py", line 132, in evaluate
    samples.extend(run_test(init_op, dataset=csv))
  File "/DeepSpeech/training/deepspeech_training/evaluate.py", line 114, in run_test
    cutoff_prob=FLAGS.cutoff_prob, cutoff_top_n=FLAGS.cutoff_top_n)
  File "/usr/local/lib/python3.6/dist-packages/ds_ctcdecoder/__init__.py", line 228, in ctc_beam_search_decoder_batch
    for beam_results in batch_beam_results
  File "/usr/local/lib/python3.6/dist-packages/ds_ctcdecoder/__init__.py", line 228, in <listcomp>
    for beam_results in batch_beam_results
  File "/usr/local/lib/python3.6/dist-packages/ds_ctcdecoder/__init__.py", line 227, in <listcomp>
    [(res.confidence, alphabet.Decode(res.tokens)) for res in beam_results]
  File "/usr/local/lib/python3.6/dist-packages/ds_ctcdecoder/__init__.py", line 138, in Decode
    return res.decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: invalid continuation byte

Naturally i assumed its utf8 issue - and i need to fix my files. However i have tried almost everything to fix the file and nothing seems to work. Note that it works for training and validation - however during tests it fails.

anon@anon-Lenovo-Legion-Y540-15IRH-PG0:~$ isutf8 ./Downloads/jaSTTDatasets/final-dev.csv 
anon@anon-Lenovo-Legion-Y540-15IRH-PG0:~$ isutf8 ./Downloads/jaSTTDatasets/final-train.csv 
anon@anon-Lenovo-Legion-Y540-15IRH-PG0:~$ isutf8 ./Downloads/jaSTTDatasets/final-test.csv 
anon@anon-Lenovo-Legion-Y540-15IRH-PG0:~$ isutf8 ./Downloads/jaSTTDatasets/new
newLogs.txt  newutf.csv   
anon@anon-Lenovo-Legion-Y540-15IRH-PG0:~$ isutf8 ./Downloads/jaSTTDatasets/newutf.csv 
anon@anon-Lenovo-Legion-Y540-15IRH-PG0:~$ iconv -f UTF-8 ./Downloads/jaSTTDatasets/newutf.csv -o /dev/null; echo $?
0
anon@anon-Lenovo-Legion-Y540-15IRH-PG0:~$ iconv -f UTF-8 ./Downloads/jaSTTDatasets/final-test.csv -o /dev/null; echo $?
0
anon@anon-Lenovo-Legion-Y540-15IRH-PG0:~$ iconv -f UTF-8 ./Downloads/jaSTTDatasets/final-train.csv -o /dev/null; echo $?
0
anon@anon-Lenovo-Legion-Y540-15IRH-PG0:~$ iconv -f UTF-8 ./Downloads/jaSTTDatasets/final-dev.csv -o /dev/null; echo $?
0

As you can see the files seem to be properly utf 8 encoded.

Here is the test csv file - final-test.zip (349 Bytes)

My docker file if your curious about the environment i am building in -

# Please refer to the TRAINING documentation, "Basic Dockerfile for training"

FROM tensorflow/tensorflow:1.15.4-gpu-py3
ENV DEBIAN_FRONTEND=noninteractive

ENV DEEPSPEECH_REPO=https://github.com/mozilla/DeepSpeech.git
ENV DEEPSPEECH_SHA=origin/master

RUN apt-get update && apt-get install -y --no-install-recommends \
        apt-utils \
        bash-completion \
        build-essential \
        cmake \
        curl \
        git \
        libboost-all-dev \
        libbz2-dev \
        locales \
        python3-venv \
        unzip \
        wget

# We need to remove it because it's breaking deepspeech install later with
# weird errors about setuptools
RUN apt-get purge -y python3-xdg

# Install dependencies for audio augmentation
RUN apt-get install -y --no-install-recommends libopus0 libsndfile1

# Try and free some space
RUN rm -rf /var/lib/apt/lists/*

WORKDIR /
RUN git clone $DEEPSPEECH_REPO DeepSpeech

WORKDIR /DeepSpeech
RUN git checkout $DEEPSPEECH_SHA

# Build CTC decoder first, to avoid clashes on incompatible versions upgrades
RUN cd native_client/ctcdecode && make NUM_PROCESSES=$(nproc) bindings
RUN pip3 install --upgrade native_client/ctcdecode/dist/*.whl

# Prepare deps
RUN pip3 install --upgrade pip==20.2.2 wheel==0.34.2 setuptools==49.6.0

# Install DeepSpeech
#  - No need for the decoder since we did it earlier
#  - There is already correct TensorFlow GPU installed on the base image,
#    we don't want to break that
RUN DS_NODECODER=y DS_NOTENSORFLOW=y pip3 install --upgrade -e .

# Tool to convert output graph for inference
RUN python3 util/taskcluster.py --source tensorflow --branch r1.15 \
        --artifact convert_graphdef_memmapped_format  --target .

# Build KenLM to generate new scorers
WORKDIR /DeepSpeech/native_client
RUN rm -rf kenlm && \
	git clone https://github.com/kpu/kenlm && \
	cd kenlm && \
	git checkout 87e85e66c99ceff1fab2500a7c60c01da7315eec && \
	mkdir -p build && \
	cd build && \
	cmake .. && \
	make -j $(nproc)
WORKDIR /DeepSpeech

ENV TF_FORCE_GPU_ALLOW_GROWTH=true

RUN apt-get update
RUN apt-get install vim -y

RUN sed -i 's/tfv1.nn.ctc_loss(labels=batch_y, inputs=logits, sequence_length=batch_seq_len)/tfv1.nn.ctc_loss(labels=batch_y, inputs=logits, sequence_length=batch_seq_len, ignore_longer_outputs_than_inputs=True)/g' training/deepspeech_training/train.py
RUN sed -i 's/sequence_length=batch_x_len)/sequence_length=batch_x_len, ignore_longer_outputs_than_inputs=True)/g' training/deepspeech_training/evaluate.py

Glad to see that training is working now. Here are some ideas:

  1. Do a test with data from training. You know that it is working, so it has to be sth else.

  2. Run just a test set from the checkpoint to not train again.

  3. It happens during beam search, so maybe the scorer. Use current master or release. Don’t know what commit that is.

  4. Check how you built the scorer. The input for that has to be correct UTF-8 as well. Your docker doesn’t show how you do that.

I already did this, but i got the same issue…

I dont understand, could you reiterate

Currently i am not using a scorer(not sure if it uses a default scorer) . The docker file pulls from current master.

Currently not using scorer.

PS

Is it ok if i build the scorer after some training, or will i need to retrain after i build the scorer. Will i have to do training all over again if i start using a scorer?

Also the loss is 270 - so maybe deepspeech is predicting some undefined charectors which cannot be decoded by utf-8

Maybe i should retry after i get a loss of under 100 to get valid results?

  1. A high loss after 2 epochs is not concerning, you have to look at both losses over time.

  2. The error will persist even with a loss of 10 as it has to do with how files are read. Check the points mentioned aboce.

OK, didn’t read this at first. As we don’t have the command you use it is hard to tell. How do you start testing?

How do you test without training?

Please read the docs carefully and understand how DeepSpeech works. Currently you don’t. This will probably lead to bad results and a bad model …

python -u DeepSpeech.py --test_files /home/anon/Downloads/jaSTTDatasets/final-test.csv --test_batch_size 4 --epochs 5 --bytes_output_mode --checkpoint_dir /home/anon/Downloads/jaSTTDatasets/checkpoint/

So when i run this command it starts testing

My command for training is

python -u DeepSpeech.py --train_files /home/anon/Downloads/jaSTTDatasets/final-train.csv --train_batch_size 4 --dev_files /home/anon/Downloads/jaSTTDatasets/final-dev.csv --dev_batch_size 4 --test_files /home/anon/Downloads/jaSTTDatasets/final-test.csv --test_batch_size 4 --epochs 5 --bytes_output_mode --checkpoint_dir /home/anon/Downloads/jaSTTDatasets/checkpoint

You have to understand how deep learning works. Please start reading somewhere, maybe in Japanese, what happens in machine learning. DeepSpeech ist currently not yet an end user product. You won’t get results if you continue without any knowledge of what happens.

You are asking about high loss due to an UTF-8 error. You use a batch size of 4 for a single line. You set epochs to 5 for a single test run.

And again, 70 hours of input is not much, you won’t get results anywhere near what Google, … do.

Update on this -
Seems like these guys were having the same issue - Training Traditional Chinese for Common Voice using Deep Speech

I used thier ‘ignore’ solution and added a few debug statements in - /usr/local/lib/python3.6/dist-packages/ds_ctcdecoder/__init__.py

def Decode(self, input):
    '''Decode a sequence of labels into a string.'''
    res = super(UTF8Alphabet, self).Decode(input)
    print("utf8 Decode function called")
    print(res)
    return res.decode('utf-8','ignore')

My test csv has only 1 record

When i test the following logs get printed -

root@6e061f9543ba:/DeepSpeech# python -u DeepSpeech.py --test_files /home/anon/Downloads/jaSTTDatasets/final-test.csv --test_batch_size 4 --epochs 1 --bytes_output_mode --checkpoint_dir /home/anon/Downloads/jaSTTDatasets/checkpoint/
I Loading best validating checkpoint from /home/anon/Downloads/jaSTTDatasets/checkpoint/best_dev-13477
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/kernel
I Loading variable from checkpoint: global_step
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_6/bias
I Loading variable from checkpoint: layer_6/weights
Testing model on /home/anon/Downloads/jaSTTDatasets/final-test.csv
Test epoch | Steps: 0 | Elapsed Time: 0:00:00                                                       utf8 Decode function called
b'\xe3\x81\xe3\x81\xe3\x81\xe3\x81\xe3\x81\xe3\x81\xe3\x81\xe3\x81\xe3\x81\xe3\x81\xe3\x81\x8b\xe3\x80\x82'
utf8 Decode function called
b'\xe3\x81\x93\xe3\x81\xae\xe6\x96\x99\xe7\x90\x86\xe3\x81\xaf\xe5\x8d\xb5\xe3\x82\x92\xe4\xba\x8c\xe5\x80\x8b\xe4\xbd\xbf\xe3\x81\x84\xe3\x81\xbe\xe3\x81\x99\xe3\x80\x82'
Test epoch | Steps: 1 | Elapsed Time: 0:00:21                                                       
Test on /home/anon/Downloads/jaSTTDatasets/final-test.csv - WER: 1.000000, CER: 0.928571, loss: 116.681183
--------------------------------------------------------------------------------
Best WER: 
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.928571, loss: 116.681183
 - wav: file:///home/anon/Downloads/jaSTTDatasets/processedAudio/1254.wav
 - src: "この料理は卵を二個使います。"
 - res: "か。"
--------------------------------------------------------------------------------
Median WER: 
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.928571, loss: 116.681183
 - wav: file:///home/anon/Downloads/jaSTTDatasets/processedAudio/1254.wav
 - src: "この料理は卵を二個使います。"
 - res: "か。"
--------------------------------------------------------------------------------
Worst WER: 
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.928571, loss: 116.681183
 - wav: file:///home/anon/Downloads/jaSTTDatasets/processedAudio/1254.wav
 - src: "この料理は卵を二個使います。"
 - res: "か。"
--------------------------------------------------------------------------------

When you encode the hex from the first call using https://mothereff.in/utf-8 its invalid, however when you encode the second call it gets encoded to - この料理は卵を二個使います。 which matches the transcript in my csv.

I am assuming the function is called once to decode the output predicted by the model and a second time to decode the transcript in the csv - this supports my hypothesis.

1 Like

I am happy that you put in the time to solve this. Only you have all the infos and if you give us only bits and pieces, it is hard to suggest solutions.

If you have the time, put in a PR that solves this issue for future Japanese/Chinese models.