Creating an Indian accent model with ~115k files

okay, as discussed earlier in this thread, i am trying to create a model with 160 hours of Indian accent audios but while running the model creation code, i am facing this error for many files that :

Exception in thread Thread-7:
Traceback (most recent call last):
File “/Users/naveen/anaconda3/lib/python3.6/threading.py”, line 916, in _bootstrap_inner
self.run()
File “/Users/naveen/anaconda3/lib/python3.6/threading.py”, line 864, in run
self._target(*self._args, **self._kwargs)
File “/Users/naveen/Downloads/DeepSpeech/DeepSpeech/util/feeding.py”, line 151, in _populate_batch_queue
raise ValueError(‘Error: Audio file {} is too short for transcription.’.format(wav_file))

ValueError: Error: Audio file /Users/naveen/Downloads/all_datasets/DeepSpeech/TEST/g0907_e_tam_f_output.wav is too short for transcription.

No, context I meant is “what is this file” ? I mean, if the file is too short, what’s wrong with just removing it and its transcription ?

There are multiple files. Initially 3. Then I removed file and its transcription.

In fact, one of the files had a good length transcription too. Example:

“eng_text_90-2_e_man_m_output.wav, 33964, tenaliraman approached thimmana and appeased him with his expertise in spontaneous poetry”

But the problem is that, after rerunning the code, i am getting same error for more files.

If in a single run only atleast i would have got to know all the files that are giving this error, i would have removed them all at once. But, that is where the problem is, i am getting error for few files then after that there is no output. And then when i am rerunning, i am getting same error for new files.

So, i am just removing files and corresponding transcriptions and rerunning the code.

What would help here is that you document what’s the transcription AND the audio length. You might be able to search more broadly this way …

what is the minimum length of audio that i should feed while training the model?

Have a look at the source code that generates the error, you’ll get the answer. The stack tells you it is at util/feeding.py:151

I checked the condition in the code:

    source = audiofile_to_input_vector(wav_file, self._model_feeder.numcep, self._model_feeder.numcontext)
    source_len = len(source)
    target = text_to_char_array(transcript, self._alphabet)
    target_len = len(target)
    if source_len < target_len:
        raise ValueError('Error: Audio file {} is too short for transcription.'.format(wav_file))

This tells me that, whenever duration of audio is less than duration of transcript text spoken, it will raise the error.

I tried to put this condition on my audio files to filter out such audio files but i am not able to recreate text_to_char_array as its coming from another code. What are your suggestions at this point?

read the source, luke!

$ git grep "def text_to_char_array"
util/text.py:def text_to_char_array(original, alphabet):

yeah, i checked that its coming from text.py code, but since that code requires some ‘config_file’, i don’t know how to recreate this function ‘text_to_char_array’ independently for my purpose. Is there any other method to filter out the smaller duration audio files?

Sorry to insist, but read the source. Your config_file is the … alphabet file. So I guess that it is something you have ?

okay got it.

i want this function to work:

def audiofile_to_input_vector(audio_filename, numcep, numcontext):
    r"""
    Given a WAV audio file at ``audio_filename``, calculates ``numcep`` MFCC features
    at every 0.01s time step with a window length of 0.025s. Appends ``numcontext``
    context frames to the left and right of each time step, and returns this data
    in a numpy array.
    """
    # Load wav files
    fs, audio = wav.read(audio_filename)

    return audioToInputVector(audio, fs, numcep, numcontext)

What do i feed in place of ‘numcep’ and ‘numcontext’? How is it getting calculated or where is it coming from?

Can you read the source calling that ? It’s clearly trivial. Hint: git grep audiofile_to_input_vector

This got resolved. Thanks a lot.

I wrote a code to filter out all the files with source_len(audio file) < target_len(transcript) and then tested the code run and it runs fine.

Now i need to use these files and run on CUDA support linux platform.

I have tensorflow-gpu -1.4 and CUDA 8.0.

When i run the main training code, i get this :

tensorflow.python.framework.errors_impl.NotFoundError: libcudart.so.9.0: cannot open shared object file: No such file or directory

Does this got to do something with my installation of tensorflow or CUDA binaries??

Your tensorflow tries to use CUDA 9.0, not 8.0.

so i uninstall CUDA 8 and install CUDA 9. Right?

Well, you said TensorFlow GPU 1.4, which should be linked to CUDA 8.0, so I’m a bit doubtful about your setup. I cannot recommend anything.

Can i give you more information??

When i do ‘nvcc --version’, i get :

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Tue_Jan_10_13:22:03_CST_2017
Cuda compilation tools, release 8.0, V8.0.61

And when i do ‘pip list | grep tensorflow’, i get:

tensorflow-gpu                     1.4.0      
tensorflow-tensorboard             0.4.0

Will getting tensorflow 1.6 and CUDA 9.0 help?

Also, if its tensorflow-gpu-1.4 and hence linked to CUDA 8.0, why is it trying to use CUDA 9.0?

How can I know ? It’s your setup, not mine :-(.

ohkay but can you recommend which tensorflow/CUDA combination will work?