Thanks. Yes, a test with Audacity, the differences were quite recognisable. I will have to look into how to break up an audio into (say) 10 second slices and ensure words are not cut off. There are a few posts here on Discourse regards that.
I did have a quick look at “audiogrep” ( https://github.com/antiboredom/audiogrep ) yesterday, but there was an error preventing me from continuing. It doesn’t look like it has been maintained for a while ?
I used Audacity recently to remove some noise in a WAV file. Considering the audios that we need to process here, there would be considerable gaps in the audio, as the speaker is pausing/waiting. It would take a while to manually go through the audio and remove those gaps. Are there any tools that can process an audio and remove (say) gaps longer than 5 seconds ?
Great tutorial and discussion. I am trying to train on 5000 utterances and it is taking a couple of hours per epoch. Can you share what configuration you used and how long each epoch took? Thanks for the help.
Yes of course, but keep in mind that it’s for a robot AI (so simplified one, and some tricks to limit bad inferences results.
# Each line in this file represents the Unicode codepoint (UTF-8 encoded)
# associated with a numeric label.
# A line that starts with # is a comment. You can escape it with \# if you wish
# to use '#' as a label.
# FOR FRENCH LIMITED CORPUS - JUST WORKING IN SOUND PERCEPTION - A BOT WILL ANALYSE RESULTS
# The last (non-comment) line needs to end with a newline.
Well, as you can see in the deepspeech process,
A wav is cut using miliseconds.
Each part of the audio cut is “linked” to a vocabulary word character, and both are sent to “builder”.
There is a big error risks in this process, because a really small gap could result in lots of errors. (Big gap, characters errors…)
So, a small wav file, nearly 5s is the “best” compromise.
You could think : “so, I’ll use wav’s about 1 word only, to avoid gap”
It’s not a good idea : starting a word and continue a word after a previous one doesn’t produce same wave form (amplitude) beginning.
Ex: “hello”, "I say hello"
Often, the waveform beginning is highter in a start word.
Yes, and I appreciate your thread here is based on building the wav files used for training, by speaking. However, that is not always the case, as sometimes we may want to do the ‘same’ type of building (i.e. build our own models), but the WAV sources used are all from a WAV file. Hence the need to cut a WAV file into small pieces, and attempting to keep words within each cut.
There is a python script there, example.py and I ran it against a 10 minute WAV file. The results were 56 WAV files, duration range from 00.63 seconds to 49.38 seconds.
Then the author of that package advised how to cut down the range duration size, as 49.38 seconds is a long way from your recommendation of 5 seconds max. The results then were 243 WAV files, duration range from 00.18 seconds to 13.44 seconds.
Of course some of those smaller duration sized WAV are just noise and even no noise, at least not that I could hear. Some 2 or 3 worded WAV’s were only 2 seconds long and there are quite a few that are just 1 word in duration.
Of those 243 WAV files, there are only 31 that exceed your recommendation of 5 seconds though, so that seems encouraging.
I get for some (but not all) vocabularies the following error:
vocab.cc:305 in void lm::ngram::MissingSentenceMarker(const lm::ngram::Config&, const char*) threw SpecialWordMissingException.
The ARPA file is missing < /s > and the model is configured to reject these models. Run build_binary -s to disable this check. Byte: 106432571
here is a sample of a typical deepspeech csv file :
/home/nvidia/DeepSpeech/data/alfred/dev/record.1.wav,87404,qui es-tu et qui est-il
/home/nvidia/DeepSpeech/data/alfred/dev/record.2.wav,101804,quel est ton nom ou comment tu t'appelles
/home/nvidia/DeepSpeech/data/alfred/dev/record.3.wav,65324,est-ce que tu vas bien
You must respect the first line (needed to create columns for CSV usage)
And each next line inform 3 values, separated by a comma :
where is the wav file, (I use complete link, perhaps relative path could work ?!)
what is it size, (you can have size with this : os.path.getsize(“the wav file”))
what is the transcript (in the wav language)
Take a look at …DeepSpeech/bin/import_ldc93s1.py, L23 for CSV creation !!
About transcript, pay attention to only enter characters present in alphabet.txt, otherwise you’ll encounter errors when training.