Was interested in knowing how many here have been successful in fine tuning the DeepSpeech models for conversational American English? If yes, how many additional hours of conversational data was used and what is the WER achieved?
Or has any one trained from scratch, if yes, with how many hours and what is the WER? Where any model changes done in this case?
I have been working with DeepSpeech 0.7.0 (Ubuntu 16.04, Python 3.6.5, Tensorflow 1.15.2, CUDA 10.0/cuDNN 7.6.5 training on RTX 2080TI). Inference on the released DeepSpeech model on my test data (conversational) results in an average WER ~40%. Changing the Language Model (with alpha, beta tuning on my custom text) results in an ~1% improvement. I have fine-tuned with about 2000 hours of part conversational part speech dataset (lr 0.00001, dropout_rate 0.40, with early stopping around 35 epochs, with es_epochs set to 20) have seen very marginal to no improvement on my test data. I am still working on improving the LM and also testing the effects with no LM.
I searched here and did not find any new posts particularly geared towards conversational sets. Was interested in knowing the experience that other people have had and what has or has not worked for them. Since the base models that are released have been trained on both Fisher and SwitchBoard (guessing it would be at least 1/3 of the total training dataset used), I was expecting slightly better results on conversational data.
Appreciate any input here.