Noob questions on training my own model

Please bear with me I’m a total noob with ASR. I’m trying to go through the guide training your own model

After running import_cv2, it generated a couple of files such as dev, other test, train and validated, but is that ok that the file size is far from the total size of the original tsv files?

The only thing I added from the tutorial’s command is add a filter alphabet which consists a-z letters because without it I’m getting errors due to detected characters from german language. Comparing the sizes of the result, did I lost a lot of data? Maybe I used filter_alphabet incorrectly?

From corpora
6M dev.tsv
43M invalidated.tsv
37M other.tsv
264K reported.tsv
3.5M test.tsv
32M train.tsv
273M validated.tsv

Files generated after running import_cv2
81k dev.csv
505K other.csv
789K test.csv
1.9K train-all.csv
1.7K train.csv
1.3M validated.csv

I also noticed that the resulting csv files, they have a transcript of only having 1 word. should I include a space character in my alphabet.txt?

my alphabet.txt looks like this

Another question, I would like to train using our audio files, mostly calls from clients. Do I need to cut up the audio files to smallers bits like the size of a phrase before I can use them in deepspeech?

Also on training, it is required to have train, dev and test csv files to supply on the command. What is the difference of those files? I think it was not mentioned in the tutorial but please correct me if I’m wrong. Will they have the same contents? (I’m also using this as reference but it’s to high level a bit hard to grasp for a beginner https://medium.com/visionwizard/train-your-own-speech-recognition-model-in-5-simple-steps-512d5ac348a5)

Any help would be appreciated.

Thanks

Haven’t tried to import for a while. Are umlaute still there, then you would have to solve that first. In earlier version, sentences were only used once even though you had different versions.

Whether to use umlaute or not is always debatable :slight_smile:

Yes, you need one for German.

Yes, usually 4-8 seconds.

Please read about Deep Learning in general first. Otherwise you won’t get good models. (@kreid maybe you could include a pointer to a tutorial? Many people don’t understand what train/dev and test are for if they don’t have a ML background. Overfitting, epochs and learning rate are usually the next questions. )

1 Like

Hi Olaf,

Thanks for your reply, now some things are answered that really confused me. Just to clarify, I’m not trying to train using german, but I’m trying to exclude words with the german or from other alphabet besides english characters.

I think I’ll get a Deep learning course in Udemy to get a general idea with deep learning.

I actually also went to learn kaldi before and the way I understand dev/train is providing the same set of phrases from different speakers. Is this correct? If yes, does it mean for deepspeech, I also need to provide the same audio that is spoken by multiple people with different genders?

Thank you,
Simon

Run a language detection on the setences. Then it is easier to exclude. Common Voice is already split into the respective language, you probably have some other data.

Not really, whether to take the same sentences from different speakers is dabatable. Search for training and validation set to get an idea.

Again, debatable. If you don’t have much data, take what you can get hold of. If you have plenty, have fewer repetitions.

1 Like

Thank you Olaf, appreciate the help

Hi @othiele, how do I control the number of repetitions? I tried searching repetition or increment in the docs but I’m not able to find any references

We have a non-gpu machine importing/training using the 50gb corpora and it took 1 week and it’s still procesing. Is that normal or do I just have to make the repetitions fewer since there’s tons of samples?

Repetitions is the same sentence twice in the material. Don’t know whether there is a switch. Otherwise scan the csv-files for multiple occurences.

Training such an amount on CPU is impossible. The import should be a bit quicker though depending on what your input audio is.

1 Like

@othiele as always, thank you for your help

I’d be very interested in this. I am working on custom training data and my next questions are things that are ‘beginner’ ML questions. For example I have a question about training data – many of the phrases contain human junk words like ‘um’, ‘er’, ‘uh’ that kind of thing. Are these samples useless, or ok to use? Should I transcribe the ‘um’ or leave it out of the transctiption?