Training/fine-tuning DeepSpeech branch/version - 0.7.0 on Linux

A_N · June 20, 2020, 11:51pm

@reuben as a question on this note:

With fine tuning (and training with custom training set) would it make sense to just use the LibriSpeech Validation set or would it be better to go for the validation set that was split from the custom dataset?

reuben · June 21, 2020, 6:44am

It only makes sense to use your own data, otherwise you’ll be fine tuning blindly, with no way to track how the model is evolving w.r.t. your data.

A_N · June 27, 2020, 6:35pm

@othiele @reuben
I am working on training/fine-tuning DeepSpeech branch/version - 0.7.0 on Linux Ubuntu 16.04 with Python version -3.6.5, TensorFlow version - 1.15.2, CUDA/cuDNN version - CUDA 10.0/cuDNN 7.6.5.
I have a 2070 hour dataset that is about 90-95% accurate in terms of the transcription itself (like the ums, uhs, repeated words, false starts, stutters are not accounted for, also occasionally there is less or more text in the transcript than in wav file itself). I initially split this as 2000hr train, 35hr validation and 35hr test sets, but later split into 2060 train, 5hr test, 5hr test sets as I manually fixed the transcription on the 5hr validation and made sure it was 99+% accurate.

My question is two fold:

Are there any automatic suggested ways to fix the missing transcription? Saw DS Align and other force alignment tools, but have not spent the time to get them work. Is that the right direction here? Or it is better to fix the data manually even if it is slow? what has worked best in your experience?
What is the suggested test/train/val split here? I am assuming my current 5 hour validation is too small (I just chose 5hours to cleanup manually mainly because released model was trained on 3817 hours with a 5 hour clean validation). Is the 5hour pick too small? or it is set sufficient? or would it better to have a split with 2-3% validation and test sets and the rest as training? Appreciate your thoughts /suggestions on this.

thanks

othiele · June 28, 2020, 5:11pm

Please stop hijacking an older post of a rising loss to post a question on train/dev/test split.
Search your question before asking and you’ll find great answers.

Akmal_Nodirov · June 29, 2020, 6:51am

Hello i have a questions about datasets: i have a 10 different words, sounded by 10 different persons, and all sounds are almost 400 of that 10 words. now how can i properly separate train, dev, and test files ? and i dont exatly know if i can use like this, i mean same words sounded by different people ? thank you

othiele · June 29, 2020, 6:54am

@A_N, @Akmal_Nodirov please learn about how to post in forums. I just said “don’t hijack old threads” and you do exactly that. Post it in a new thread with a good headline and we can discuss it. And do your research first:

https://discourse.mozilla.org/t/what-and-how-to-report-if-you-need-support/62071/2

Akmal_Nodirov · June 29, 2020, 7:46am

here, i didnt understand with test data, please clarify your answer, can we get test data from our train data ?

othiele · June 29, 2020, 7:58am

OK:
(1) search for test/dev split in this forums
(2) open a new thread/ticket/whatever you call it with a descriptive title
(3) write down what you have learned while searching
(4) ask for what is still unknown

Do not post further in this thread

Akmal_Nodirov · June 29, 2020, 8:31am

what do you think am i doing now?

tanner · July 10, 2020, 1:58am

I’ve split this into its own topic. Please do not hijack threads with unrelated questions.

Topic		Replies	Views
Question on training data set DeepSpeech	3	375	June 22, 2020
Fine Tuning with Custom English Data(Very Small Size) DeepSpeech	1	367	April 5, 2021
Has anyone successfully fine tuned a deepspeech mode? DeepSpeech	4	1135	August 1, 2018
Fine tuning with custom dataset doubles WER DeepSpeech	3	494	July 9, 2020
Fine tuning failing on custom dataset DeepSpeech	5	793	September 28, 2020

Training/fine-tuning DeepSpeech branch/version - 0.7.0 on Linux

Related topics