Wakeword example usuing synthetic data

Some back history is my own experience with various opensource wakeword is they are not particulary accurate and often prone to false positives as negatives.

I have made an improvement in the above as near all seem to provide not much more than a binary classification of Wakeword, Unknownwords and Noise.
This has several problems with a lack of cross entropy so that the model overfits on training, but interms of features there is a huge class imbalance.

I have fixed this to a certain extent by adding further classes and the above example is ‘Computer’ via a CRNN from the Google Research KWS streaming repo.

What I do is quite simple by 1st creating a language database ‘English’ and creating a sylable and phonemetables and count.

‘Unknown’ then becomes all words with the same sylables as the wakeword, as an aproximation of key spectra in MFCC.
With a single word 2 classes ‘LikeKW1’, ‘LikeKW2’ that uses a phoneme slection to create ‘Sounds like’ on the front and last Sylables.
These words are excluded from Unknown because of the way softmax works but they make the training work harder to find distinguising features.
Then a ‘1syl’ class of one sylable words to try and force overall edge/texture detection and one I call ‘Phon’ that is the complete KW duration of words concatenated and trimmed in a similar manner to noise again forcing overall edge/texture detection.
So end up with Kw, LikeKw1, LikeKw2, 1syl, Phon NotKw (unknown) and Noise as classifications.

Now I use TTS and a toy dataset of voices from the following.
Coqui ⓍTTSv2 870
Emotivoice 1932
Piper 904 sherpa-onnx
Kokorov1 53 sherpa-onnx
Kokorov1_1 103 sherpa-onnx
Kokoro_en 11 sherpa-onnx
VCKT 109 sherpa-onnx

I say ‘toy’ as its just shy of 4000 wakeword with the 14000 of NotKw that sets my class sample size that all are augmented up to that qty.
TTS are great as its key to get clean samples so you can be accurate in augmentation of noise and reverberation even if I don’t bother with reverb.
From phonetic columns various SqLite group by clauses, to try to create some even distribution of Phones and augmentation levels and that Wakeword voice exists accross classification so that edges of spectra becomes more important than any possible textures that different recordings and different datasets can provide.

I have added resultant Tflite of a streaming / nonstreaming version and quantised version with the training logs and started with the basic Kw,NotKw ,Noise then add the x2 LikeKw, then the 1Syl and finally phon all in different training runs and included the logs so you can see the training curves this provides.

Keyword models work better with more classifications and your dataset design should try to force equal distribution of features whilst more means there is less chance of softmax triggering not because it has a high feature hit but that, hits are extremely low in all other classes.

It works well but I stopped at just shy of 4000 voices which obviously for all the prosody of input spoken english could have, is hugely overfiited. There is aproblem that modern TTS like much in opensource speech tech have language covered but is completely shy when it comes to dialect, accents and the varied prosody we meet in the wild.
It actually works pretty damn well but even using cloning TTS trying to find a source of dialect / accent datasets that have good metadata with good prosody range, seems a huge struggle.
Otherwise I would of continued and maybe moved on to providing Transfer learning so that I can create other wakeword with smaller datasets.

I did start off this quest but that lack of clean recordings of any use that are not incorrectly labelled due to problems of forced alignment has stopped me using the like of CommonVoice and ML-Commons that also don’t have metadata to filter and create balanced datasets.

I am pleasantly surprised by the results as was expecting less with what is purely an example toy dataset that seems far more tolerant to false positives, accurate to the KW and works at least out to 3meter without a microphone array or noise filter.
Classification models just work better with more classes to spread and balance feature and likely further can be add by adding similar sylable KW but unique phonetics and also gives choice of wakeword in use with no increase in model size and from what I can see no observable compute increase, yet at least.

I have a basic understanding of ML and classification wakeword models where with a simple model results are very dependent on the dataset.
I am wondering with you more knowleable out there are there better methods?
Also anyone got any hints on how to get more voices that are not just ‘Neutral’ english more representation of the many dialects and nations that use English as a 2nd language and as some of you might think the same of those in England the same :slight_smile:
I am presuming its similar where Beijingese is what i would call the common ‘neutral’ dialect which is prominant to such an extent other dialects get very little attention?
Anyone know of and large dialect prosody dataset of phonetic pangram sentences to use with one of the latest cloning TTS?

Phonetic pangrams are simple short sentences conatining all the phoneme of a language and likely have more value than just random sentences that requests could go on ad infinitum.
If there was a dataset of short sentences of many dialects, accents phonetic in criteria it would be massively helpful in the use of cloning TTS to provide essential clean data for all types of speech dataset.

It needs linguists as the prosody / intonation of multi sylable words is a little more complex than just base phonetics but likely could make far more concise, smaller datasets rich in value. Where many voices saying the same is of much use.