I am newbie in deep learning and working around deepspeech . Would like to know the format for audio and transcription for preparing data
Split your audio files to sentence length (say 1 - 15 seconds). Then create three files, one for training (
train.csv), one for development testing (
dev.csv), one for evaluation testing (
test.csv). The file names are arbitrary. The first line of each must contain column declarations, and there must be at least these columns:
There can be any number of other columns if you need them. The subsequent lines contain the data in the order defined by the header line.
wav_filenamecorresponds to the path to the audio relative to the csv file.
wav_filesizeis the number of bytes of the audio file.
transcriptis the transcript limited to your alphabet.
A sample of a file could be:
wav_filename,wav_filesize,transcript clips/sentence1.wav,16444,the cat sat on the mat clips/sentence2.wav,21010,the bat shat on the cat
The size ratio of train:dev:test is usually 8:1:1, but that’s more of a convention than a generally optimal number.
The audio files must be standard WAV, 16kHz, mono.
Thank you @sixtease for the detailed explanation. It was helpful
Love the examples, there should be a generator for that