Problem in voice corpus tool augment when converting stereo to mono

when i used voice corpus tool augment to inject noise in the clean audio file it producing stereo audio file but we cant train stereo audio file right If it is the case I converted the stereo audio file to mono but noise is remove from the audio file while converting stereo to mono. What to do any suggestion.
I used ffmpeg to convert stereo to mono and also sampling rate.


Do you have a link to the augmentation tool you’re referring to?

mozilla voice corpus tool I figured it out Instead of voice corpus tool I used sox merge to inject noise in the data. it worked perfectly.I got both noise and the clean audio in single mono file.

I have one doubt regarding language model . I generated a language model with my small dataset. Then I inferenced the model with my language model it working somewhat but librispeech language model work well compared to my language model ,but my language model should work well because I have constrained data in the lm, because my lm have less word so it can easily figure out.I didnt also used any out of training word.Can you explain why librispeech language model perform well eventhough it does not seen my data sentence. If lm is n-gram based model it should work well on my lm.


hi, when I used my own language model and also with my own trie result is not good. but when I changed the my trie to librispeech trie and kept the my own language model. It’s working perfectly then librispeech lm.binary. Now my constrained language model better result then librispeech lm .I am not saying that librispeech is not working it is also working perfectly for example In my dataset I have a word ‘rating’ but I dont have ‘at’ So in librispeech lm predicting as ‘at’ because it has ‘at’ word so it predicting correctly,but in my dataset there is no ‘at’ so my language model predicted as ‘rating’ it is also working perfectly . Now my doubt why when I used my own trie it is not working perflectly but when I used librispeech trie it worked perflectly.

Sounds like you need to go over all the steps you did with the language model and trie. I had previously posted an example for limited vocabulary here, so might be worth trying using that as an initial test (others have made that work for them, so then you’d know you were on track and not suffering from some simple typo or misunderstanding). Then you could try switching out the vocab for the example and trying your own.

Those steps should still be current but if you’re using particular versions (or coming back to this much later) then bear in mind that the steps could be different then, so that might also need care.

1 Like

Thanks , can you tell me what’s the below code purpose in deep speech and does Mozilla deep speech use Bilstm or Lstm.
def create_overlapping_windows(batch_x):
batch_size = tf.shape(batch_x)[0]
window_width = 2 * Config.n_context + 1
num_channels = Config.n_input

# Create a constant convolution filter using an identity matrix, so that the
# convolution returns patches of the input tensor as is, and we can create
# overlapping windows over the MFCCs.
eye_filter = tf.constant(np.eye(window_width * num_channels)
                           .reshape(window_width, num_channels, window_width * num_channels), tf.float32) # pylint: disable=bad-continuation

# Create overlapping windows
batch_x = tf.nn.conv1d(batch_x, eye_filter, stride=1, padding='SAME')

# Remove dummy depth dimension and reshape into [batch_size, n_windows, window_width, n_input]
batch_x = tf.reshape(batch_x, [batch_size, -1, window_width, num_channels])

return batch_x

As per comments here it’s not bidirectional, just a unidirectional LSTM