Edit :
Following What and how to report:
- Mozilla STT version: DeepSpeech 0.8.2
- OS: Linux Mint 19.1 Tessa
- Python 3.6.9
- Tensorflow 1.15.2
- not using GPU
Hello I’ve been looking into DS for a while.
I’ve installed DeepSpeech with pip.
pip3 install deepspeech
and downloaded both the pre-trained model and the scorer from the latest release (v0.8.2).
I ran inference with
deepspeech --model /my/path/to/deepspeech-0.8.2-models.pbmm --scorer /my/path/to/deepspeech-0.8.2-models.scorer --audio /my/path/to/myaudio.wav
and got the following results:
original : "in the first three forms we copy the tree to the entries"
inference : "in the first three fortunes we copied the three to the empress"
original : "if you already downloaded and used the tools for something else"
inference : "if you are a detonation he the tools for something else"
original : "you are happily working on something and find the changes in these files are in good order"
inference : "you are happily working on something and find the changes in this files are in good order"
original : "this is most often done when you remembered what you just committed is incomplete"
inference : "this is most often done when you remembered what you just committed to is incomplete"
In the above files, I’m trying my best to get results, by mimicking a US accent.
(Before that, I had tested DS with my normal (greek) accent with worse results)
I wasn’t exactly satisfied with my results with accented speech so I looked around to find a solution.
I have read the documentation and focused on the “Fine-tuning” section.
I am wondering if it’s possible to fine tune a model with limited data from a single speaker.
And by that I mean:
- gather transcribed audio from a single person, preferably small in size (around 2-5 minutes of speech)
- fine tune a model (from the final checkpoint) in a “small” amount of time (hopefully less than 30 minutes)
Since I don’t have experience in speech recognition, or in NN’s training,
I will try to document my every step so the experts can point out my mistakes
I’ve followed the documentation steps,
git clone https://github.com/mozilla/DeepSpeech
python3 -m venv $HOME/tmp/deepspeech-train-venv/
source $HOME/tmp/deepspeech-train-venv/bin/activate
-
cd DeepSpeech pip3 install --upgrade pip==20.0.2 wheel==0.34.2 setuptools==46.1.3 pip3 install --upgrade -e .
- I have
python3-dev
- don’t have CUDA (so I used
--load_cudnn
later), so my times are acceptably bad. - skipped Dockerfile, as in
make Dockerfile.train
- downloaded the checkpoint model, pre-trained model and scorer from latest release (v0.8.2)
- prepared data:
I’ve seen some posts for splitting your corpus to a7:2:1
or8:1:1
ratio fortrain : dev : test
respectively,
and some posts (as well as the release documentation) used the same set for validation (dev
) and the same for testing (test
).
So I did the same (including the csv and audio files bellow)
with two directories/train
and/dev
, and the *.wav file paths in the csv are relative.
I usedwc -c
to get the audio file sizes in bytes and used them in the csv files.
I wondered if I should pass something in hyperparameter--read_buffer
, but it seemed to work fine as is. -
python3 DeepSpeech.py --load_cudnn --n_hidden 2048 --checkpoint_dir /my/path/to/deepspeech-0.8.2-checkpoint/ --epochs 3 --train_files /my/path/to/train/train.csv --dev_files /my/path/to/dev/dev.csv --test_files /my/path/to/dev/dev.csv --learning_rate 0.0001 --export_dir /my/path/to/deepspeech-0.8.2-myname --scorer /my/path/to/deepspeech-0.8.2-models.scorer
I included a file with the output of the fine-tuning (the command above),
but here are the results (paths are hidden):
Epoch 0 | Training | Elapsed Time: 0:10:18 | Steps: 20 | Loss: 63.334317
Epoch 0 | Validation | Elapsed Time: 0:00:11 | Steps: 5 | Loss: 63.075968 | Dataset:/my/path/to/dev/dev.csv
I Saved new best validating model with loss 63.075968 to: /my/path/to/deepspeech-0.8.2-checkpoint/best_dev-732562
--------------------------------------------------------------------------------
Epoch 1 | Training | Elapsed Time: 0:09:55 | Steps: 20 | Loss: 72.434603
Epoch 1 | Validation | Elapsed Time: 0:00:11 | Steps: 5 | Loss: 89.295271 | Dataset: /my/path/to/dev/dev.csv
--------------------------------------------------------------------------------
Epoch 2 | Training | Elapsed Time: 0:10:05 | Steps: 20 | Loss: 42.426105
Epoch 2 | Validation | Elapsed Time: 0:00:12 | Steps: 5 | Loss: 69.895619 | Dataset: /my/path/to/dev/dev.csv
--------------------------------------------------------------------------------
I FINISHED optimization in 0:31:04.132895
I Loading best validating checkpoint from /my/path/to/deepspeech-0.8.2-checkpoint/best_dev-732562
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/kernel
I Loading variable from checkpoint: global_step
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_6/bias
I Loading variable from checkpoint: layer_6/weights
Testing model on /my/path/to/dev/dev.csv
Test epoch | Steps: 5 | Elapsed Time: 0:00:12
Test on /my/path/to/dev/dev.csv - WER: 0.611111, CER: 0.375887, loss: 63.075970
--------------------------------------------------------------------------------
Best WER:
--------------------------------------------------------------------------------
WER: 0.400000, CER: 0.291667, loss: 24.661560
- wav: file:///my/path/to//dev/utterance_0.wav
- src: "training a model is easy"
- res: "train a model is these"
--------------------------------------------------------------------------------
WER: 0.500000, CER: 0.451613, loss: 31.155817
- wav: file:///my/path/to/dev/utterance_2.wav
- src: "be quiet and only report errors"
- res: "be quiet and alligator"
--------------------------------------------------------------------------------
WER: 0.583333, CER: 0.392857, loss: 56.349792
- wav: file:///my/path/to/dev/utterance_1.wav
- src: "in the first three forms we copy the tree to the entries"
- res: "in the first three fours we go petitoire"
--------------------------------------------------------------------------------
WER: 0.647059, CER: 0.400000, loss: 121.987396
- wav: file:///my/path/to/dev/utterance_3.wav
- src: "you are happily working on something and find the changes in these files are in good order"
- res: "you are a believer in enforcing and find the chances in this is a bamboo"
--------------------------------------------------------------------------------
WER: 0.714286, CER: 0.333333, loss: 81.225273
- wav: file:///my/path/to/dev/utterance_4.wav
- src: "this is most often done when you remembered what you just committed is incomplete"
- res: "this is not an den i omeme what you just come he is in complete"
--------------------------------------------------------------------------------
Median WER:
--------------------------------------------------------------------------------
WER: 0.400000, CER: 0.291667, loss: 24.661560
- wav: file:///my/path/to/dev/utterance_0.wav
- src: "training a model is easy"
- res: "train a model is these"
--------------------------------------------------------------------------------
WER: 0.500000, CER: 0.451613, loss: 31.155817
- wav: file:///my/path/to/dev/utterance_2.wav
- src: "be quiet and only report errors"
- res: "be quiet and alligator"
--------------------------------------------------------------------------------
WER: 0.583333, CER: 0.392857, loss: 56.349792
- wav: file:///my/path/to/dev/utterance_1.wav
- src: "in the first three forms we copy the tree to the entries"
- res: "in the first three fours we go petitoire"
--------------------------------------------------------------------------------
WER: 0.647059, CER: 0.400000, loss: 121.987396
- wav: file:///my/path/to/dev/utterance_3.wav
- src: "you are happily working on something and find the changes in these files are in good order"
- res: "you are a believer in enforcing and find the chances in this is a bamboo"
--------------------------------------------------------------------------------
WER: 0.714286, CER: 0.333333, loss: 81.225273
- wav: file:///my/path/to/dev/utterance_4.wav
- src: "this is most often done when you remembered what you just committed is incomplete"
- res: "this is not an den i omeme what you just come he is in complete"
--------------------------------------------------------------------------------
Worst WER:
--------------------------------------------------------------------------------
WER: 0.400000, CER: 0.291667, loss: 24.661560
- wav: file:///my/path/to/dev/utterance_0.wav
- src: "training a model is easy"
- res: "train a model is these"
--------------------------------------------------------------------------------
WER: 0.500000, CER: 0.451613, loss: 31.155817
- wav: file:///my/path/to/dev/utterance_2.wav
- src: "be quiet and only report errors"
- res: "be quiet and alligator"
--------------------------------------------------------------------------------
WER: 0.583333, CER: 0.392857, loss: 56.349792
- wav: file:///my/path/to/dev/utterance_1.wav
- src: "in the first three forms we copy the tree to the entries"
- res: "in the first three fours we go petitoire"
--------------------------------------------------------------------------------
WER: 0.647059, CER: 0.400000, loss: 121.987396
- wav: file:///my/path/to/dev/utterance_3.wav
- src: "you are happily working on something and find the changes in these files are in good order"
- res: "you are a believer in enforcing and find the chances in this is a bamboo"
--------------------------------------------------------------------------------
WER: 0.714286, CER: 0.333333, loss: 81.225273
- wav: file:///my/path/to/dev/utterance_4.wav
- src: "this is most often done when you remembered what you just committed is incomplete"
- res: "this is not an den i omeme what you just come he is in complete"
--------------------------------------------------------------------------------
I searched around for a similar question or documentation on small amount of data, but didn’t find any.
So here are my questions:
- did I miss a step somewhere?
- is it possible to fine-tune a model with a small amount of data and time?
if so, do we have a lower threshold for the training set size or a similar documentation/paper I can read? - is there any correlation between the amount of training data and the epochs? Which epochs value would be suggested?
- is my accent too heavy and/or not included in training set? (I have a greek accent)
- can the checkpoint model(v0.8.2) be considered equivalent to the pre-trained model v0.8.2)?
After testing them both I got the same results, but I wanted to make sure:- For the inference (pre-trained model) I used:
deepspeech --model /my/path/to/deepspeech-0.8.2-models.pbmm --scorer /my/path/to/deepspeech-0.8.2-models.scorer --audio /my/path/to/dev/utterance_0.wav
(run on all the audio files _0 to _4) I got:
original : "training a model is easy" inference : "training a model is easy" original : "in the first three forms we copy the tree to the entries" inference : "in the first three fortunes we copied the three to the empress" original : "be quiet and only report errors" inference : "be quiet and only report terrors" original : "you are happily working on something and find the changes in these files are in good order" inference : "you are happily working on something and find the changes in this files are in good order" original : "this is most often done when you remembered what you just committed is incomplete" inference : "this is most often done when you remembered what you just committed to is incomplete"
- For the WER (checkpoint model) I used:
DeepSpeech.py --load_cudnn --checkpoint_dir /my/path/to/deepspeech-0.8.2-checkpoint/ --test_files /my/path/to/dev/dev.csv --scorer /my/path/to/deepspeech-0.8.2-models.scorer
(the same audio files are in the/dev/dev.csv
)
Testing model on /my/path/to/dev/dev.csv Test epoch | Steps: 5 | Elapsed Time: 0:00:12 Test on /my/path/to/dev/dev.csv - WER: 0.611111, CER: 0.375887, loss: 63.075970 -------------------------------------------------------------------------------- Best WER: -------------------------------------------------------------------------------- WER: 0.000000, CER: 0.000000, loss: 1.462615 - wav: file:///my/path/to/dev/utterance_0.wav - src: "training a model is easy" - res: "training a model is easy" -------------------------------------------------------------------------------- WER: 0.058824, CER: 0.022222, loss: 17.974970 - wav: file:///my/path/to/dev/utterance_3.wav - src: "you are happily working on something and find the changes in these files are in good order" - res: "you are happily working on something and find the changes in this files are in good order" -------------------------------------------------------------------------------- WER: 0.071429, CER: 0.037037, loss: 14.509259 - wav: file:///my/path/to/dev/utterance_4.wav - src: "this is most often done when you remembered what you just committed is incomplete" - res: "this is most often done when you remembered what you just committed to is incomplete" -------------------------------------------------------------------------------- WER: 0.166667, CER: 0.032258, loss: 4.264238 - wav: file:///my/path/to/dev/utterance_2.wav - src: "be quiet and only report errors" - res: "be quiet and only report terrors" -------------------------------------------------------------------------------- WER: 0.333333, CER: 0.214286, loss: 40.977341 - wav: file:///my/path/to/dev/utterance_1.wav - src: "in the first three forms we copy the tree to the entries" - res: "in the first three fortunes we copied the three to the empress" -------------------------------------------------------------------------------- Median WER: -------------------------------------------------------------------------------- WER: 0.000000, CER: 0.000000, loss: 1.462615 - wav: file:///my/path/to/dev/utterance_0.wav - src: "training a model is easy" - res: "training a model is easy" -------------------------------------------------------------------------------- WER: 0.058824, CER: 0.022222, loss: 17.974970 - wav: file:///my/path/to/dev/utterance_3.wav - src: "you are happily working on something and find the changes in these files are in good order" - res: "you are happily working on something and find the changes in this files are in good order" -------------------------------------------------------------------------------- WER: 0.071429, CER: 0.037037, loss: 14.509259 - wav: file:///my/path/to/dev/utterance_4.wav - src: "this is most often done when you remembered what you just committed is incomplete" - res: "this is most often done when you remembered what you just committed to is incomplete" -------------------------------------------------------------------------------- WER: 0.166667, CER: 0.032258, loss: 4.264238 - wav: file:///my/path/to/dev/utterance_2.wav - src: "be quiet and only report errors" - res: "be quiet and only report terrors" -------------------------------------------------------------------------------- WER: 0.333333, CER: 0.214286, loss: 40.977341 - wav: file:///my/path/to/dev/utterance_1.wav - src: "in the first three forms we copy the tree to the entries" - res: "in the first three fortunes we copied the three to the empress" -------------------------------------------------------------------------------- Worst WER: -------------------------------------------------------------------------------- WER: 0.000000, CER: 0.000000, loss: 1.462615 - wav: file:///my/path/to/dev/utterance_0.wav - src: "training a model is easy" - res: "training a model is easy" -------------------------------------------------------------------------------- WER: 0.058824, CER: 0.022222, loss: 17.974970 - wav: file:///my/path/to/dev/utterance_3.wav - src: "you are happily working on something and find the changes in these files are in good order" - res: "you are happily working on something and find the changes in this files are in good order" -------------------------------------------------------------------------------- WER: 0.071429, CER: 0.037037, loss: 14.509259 - wav: file:///my/path/to/dev/utterance_4.wav - src: "this is most often done when you remembered what you just committed is incomplete" - res: "this is most often done when you remembered what you just committed to is incomplete" -------------------------------------------------------------------------------- WER: 0.166667, CER: 0.032258, loss: 4.264238 - wav: file:///my/path/to/dev/utterance_2.wav - src: "be quiet and only report errors" - res: "be quiet and only report terrors" -------------------------------------------------------------------------------- WER: 0.333333, CER: 0.214286, loss: 40.977341 - wav: file:///my/path/to/dev/utterance_1.wav - src: "in the first three forms we copy the tree to the entries" - res: "in the first three fortunes we copied the three to the empress" --------------------------------------------------------------------------------
- For the inference (pre-trained model) I used:
- does the fine-tuning process modify the checkpoint model?
- in this post there is a suggestion to use some of Common Voice data set when fine-tuning,
however it is unclear to me if this suggestion is retracted at the end. So rephrasing the post:
Is it a good idea to mix some data of the Common Voice data set (or any other data set used in the training) PLUS the speaker’s data when fine-tuning?
(if it’s a good idea, I’m guessing I should find similar accented data with the speaker’s accent)
Thank you very much!
Edit: it seems I’m a new user in Discourse so i cannot upload any files yet…
I’ll just describe them and give you a sample (no audio sadly).
In the train
directory i have:
- train.csv
- 20 wav files (utterance_0.wav - utterance_19.wav)
here is a preview of train.csv
wav_filename,wav_filesize,transcript
utterance_0.wav,159788,author of the danger trail philip steels etc
utterance_1.wav,196652,not at this particular case tom apologized whittemore
utterance_2.wav,151596,for the twentieth time that evening the two men shook hands
utterance_3.wav,151596,lord but i'm glad to see you again phil
utterance_4.wav,86060,will we ever forget it
utterance_5.wav,159788,god bless 'em i hope i'll go on seeing them forever
utterance_6.wav,155692,and you always want to see it in the superlative degree
utterance_7.wav,114732,gad your letter came just in time
In the dev
directory i have:
- dev.csv
- 5 wav files (utterance_0.wav - utterance_4.wav)
here is the whole dev.csv
wav_filename,wav_filesize,transcript
utterance_0.wav,98348,training a model is easy
utterance_1.wav,163884,in the first three forms we copy the tree to the entries
utterance_2.wav,131116,be quiet and only report errors
utterance_3.wav,225324,you are happily working on something and find the changes in these files are in good order
utterance_4.wav,225324,this is most often done when you remembered what you just committed is incomplete
(in case you are wondering, the training text is from CMU Sphinx tutorial, from the CMU Arctic Data set)