Can someone explain how to prepare my own data (format of both audio and transcription)?

I am newbie in deep learning and working around deepspeech . Would like to know the format for audio and transcription for preparing data

1 Like

Split your audio files to sentence length (say 1 - 15 seconds). Then create three files, one for training (train.csv), one for development testing (dev.csv), one for evaluation testing (test.csv). The file names are arbitrary. The first line of each must contain column declarations, and there must be at least these columns:


There can be any number of other columns if you need them. The subsequent lines contain the data in the order defined by the header line.

  • wav_filename corresponds to the path to the audio relative to the csv file.
  • wav_filesize is the number of bytes of the audio file.
  • transcript is the transcript limited to your alphabet.

A sample of a file could be:

clips/sentence1.wav,16444,the cat sat on the mat
clips/sentence2.wav,21010,the bat shat on the cat

The size ratio of train:dev:test is usually 8:1:1, but that’s more of a convention than a generally optimal number.

The audio files must be standard WAV, 16kHz, mono.



Thank you @sixtease for the detailed explanation. It was helpful :slight_smile:

Love the examples, there should be a generator for that :slight_smile:

1 Like