Classify speech into predetermined sentences

I am trying to build a model that will classify spoken Spanish sentences into a set of around 2000 possible answer sentences.

So far, I have tried to build a model by converting the audio into MFCC form then training a CNN on the data. It was accurate on the training data but very inaccurate on unseen data. The training data consisted of 19 speakers and 38000 examples.

Now, I want to follow the example shown in this post TUTORIAL : How I trained a specific french model to control my robot but I am unsure that it will be accurate on unseen voices. Do you think that will be an issue if I train it on 19 different voices saying each sentence? Also, could I improve the accuracy using the Spanish common voice dataset?

If you were trying to build a model to classify spoken Spanish sentences into a set of 2000 possible answer sentences, what would be your approach?


What exactly is your goal here? Producing some command-driven tooling?

Sounds like you were overfitting. Can you post a bit more about the CNN? Just out of curiosity.

Usually you try to get them to say different stuff, but if you already have that, train with it. 38000 sentences is not much though, combine with the Common Voice sentence and whatever you can get your hands on. But even with Common Voice it will be hard.

Lastly, read about a custom language model which you could use to enhance recognition results. But as @lissyx says, that depends on the use case. What are you working on?


The use case is that the users will say one of 2000 Spanish sentences and the model should be able to predict which of the sentences that they said. It is for a language learning application.

Something that might improve my chances of success is that I do not need the model to predict the actual sentence as being the most likely answer. If it can predict the actual sentence to be in the top 5% of results then that would be acceptable.

So, let’s say the correct sentence is “Mi nombre es Daniel” and the user actually says “mi nombre es Daniel” but the algorithm predicts some other sentence as being more likely. This would be ok if “Mi nombre es Daniel” is somewhere in the top 5% of most likely results.

This is the model:

i = Input(shape=train_X_ex[0].shape)
x = Conv2D(32, (3, 3), strides=2, activation='relu')(i)
x = Conv2D(64, (3, 3), strides=2, activation='relu')(x)
x = Conv2D(128, (3, 3), strides=2, activation='relu')(x)
x = Flatten()(x)
x = Dropout(0.2)(x)
x = Dense(512, activation='relu')(x)
x = Dropout(0.2)(x)
x = Dense(K, activation='softmax')(x)

modelMCSSEarlyStopping = Model(i, x)

 r =, train_y, validation_data=(test_X_ex, test_y), epochs=epochs)

In training, I found that the validation loss would only go down for the 1st 2 epochs then it would rise from there. I couldn’t load all the data into memory so I passed it in as batches of about 3800.

Also, I augmented each training example six times by increasing the pitch, decreasing the pitch, adding white noise, lowering the volume, speeding it up and removing blank audio before the person starts speaking. So, this resulted in 38000x6 total examples. I think this may have something to do with the overfitting.

I also used models on spectrograms and mel spectrograms but they did not seem to work so well.

Do you have any suggestions of how I might be able to do transfer learning on a similar model perhaps or any other suggestions? Thanks.

Install DeepSpeech and the English model and scorer. Then search this forum for custom language models/scorer. Try a bit with English. Then transfer that to Spanish. I don’t know whether there is a free model + scorer, but if you build one yourself, using the same input 6 times sounds more like images + CNNs but not Speech …

1 Like

You should try and see there are other people who shared working on spanish model.
Also, you should be able to base your work on which I just completed updating to 0.7

1 Like