The use case is that the users will say one of 2000 Spanish sentences and the model should be able to predict which of the sentences that they said. It is for a language learning application.
Something that might improve my chances of success is that I do not need the model to predict the actual sentence as being the most likely answer. If it can predict the actual sentence to be in the top 5% of results then that would be acceptable.
So, let’s say the correct sentence is “Mi nombre es Daniel” and the user actually says “mi nombre es Daniel” but the algorithm predicts some other sentence as being more likely. This would be ok if “Mi nombre es Daniel” is somewhere in the top 5% of most likely results.
This is the model:
i = Input(shape=train_X_ex.shape)
x = Conv2D(32, (3, 3), strides=2, activation='relu')(i)
x = Conv2D(64, (3, 3), strides=2, activation='relu')(x)
x = Conv2D(128, (3, 3), strides=2, activation='relu')(x)
x = Flatten()(x)
x = Dropout(0.2)(x)
x = Dense(512, activation='relu')(x)
x = Dropout(0.2)(x)
x = Dense(K, activation='softmax')(x)
modelMCSSEarlyStopping = Model(i, x)
r = model.fit(train_X_ex, train_y, validation_data=(test_X_ex, test_y), epochs=epochs)
In training, I found that the validation loss would only go down for the 1st 2 epochs then it would rise from there. I couldn’t load all the data into memory so I passed it in as batches of about 3800.
Also, I augmented each training example six times by increasing the pitch, decreasing the pitch, adding white noise, lowering the volume, speeding it up and removing blank audio before the person starts speaking. So, this resulted in 38000x6 total examples. I think this may have something to do with the overfitting.
I also used models on spectrograms and mel spectrograms but they did not seem to work so well.
Do you have any suggestions of how I might be able to do transfer learning on a similar model perhaps or any other suggestions? Thanks.