Hi, My research is about voice recognition, I would to contribute and modify DeepSpeech, so It could recognize voiceprint together with speech to text using speaker diarization in Chinese-Taiwan language. Now, I also start to study more about deep learning and machine learning. My question is:
Is it really possible to hack the deep speech neural network manually and modify the layer so it could do additional tasks like recognize voiceprint also? If it is possible, please give me some clue about modifying and hacking Deep Speech Neural Network.
In my case, I want to recognize together both speech and voiceprint so DeepSpeech could have two outputs? Is it true that my keywords to get this approach is “Multitask-Learning”? Is it true also my keywords is “Speaker Diarization”?
currently I try to train DeepSpeech using Taiwanese datasets from common voice, but the Taiwanese Dataset currently only has an overall total of 73 hours and I am sure it is not enough to reach higher accuracy. But I tried to train with only these data first.
This is my code for training:
python DeepSpeech.py --train_files ./data/CV/zh-TW/clips/train.csv --dev_files ./data/CV/zh-TW/clips/dev.csv --test_files ./data/CV/zh-TW/clips/test.csv -epochs 20 -export_dir ./model_result --use_allow_growth true
and I got loss something like this:
Epoch 19| Training | Elapsed Time: 4:28:26 | Steps: 17692 | Loss: 60.065648 Epoch 20 | Validation | Elapsed Time: 0:10:06 | Steps: 2627 | Loss: 62.984308 | Dataset: ./data/CV/zh-TW/clips/dev.csv
Is it a normal loss for only 73 hours datasets? Or is it too big? Because I am not really sure whether I trained correctly or not?
- Because Taiwanese datasets are only 73 Hours, I plan to do transfer learning from other Chinese datasets that already have hundred hours or could I use transfer learning from English languages? If anyone ever does transfer learning with DeepSpeech before please kindly share your experience.
Thank you very much for your nice help. I am really willing to learn more about this.