I would like to develop a dedicated STT system. More precisely, I want
to make a system that recognizes (and also generates) a “small” number
of sentences (~5,000) but perfectly. Would it be possible with DeepSpeech ?
I was also wondering whether such systems would require a GPU server tu
work or if they could run on small platforms such as Paspberry Pi 3 B.
First, about the RPI3 usage, inferences with large model doesn’t seems to be possible, at least for now because inference process takes too long time, to hope realtime.
Lissyx is working on AOT model (optimized), so perhaps it could be possible, reducing the model !
You’d like to use a model, with a 5000 possible sentences, with the best accuracy possible :
Sure, with Deepspeech, you’re in the right place.
Hoping that your sentences contain standard words,
you create a vocab.txt, containing all your ~5000 sentences, a LM and a TRIE files.
You should obtain a very good accuracy, quite better than everything known in phonemes STT
But, I must say that, actually, there is no perfect solution ! Not yet…
hi @jtane
Well, quality model is dependant of learnt material volume.
Sure it must be good quality material.
the more material (wav sentences/corresponding transcritions), the better the model inferences !
Of course, the more material == the bigger the model !