Developing ad-hoc STT and TTS systems

Dear all,

I would like to develop a dedicated STT system. More precisely, I want
to make a system that recognizes (and also generates) a “small” number
of sentences (~5,000) but perfectly. Would it be possible with DeepSpeech ?

I was also wondering whether such systems would require a GPU server tu
work or if they could run on small platforms such as Paspberry Pi 3 B.

Thank you for your help !

Vincent

Hi, Vincent,

First, about the RPI3 usage, inferences with large model doesn’t seems to be possible, at least for now because inference process takes too long time, to hope realtime.
Lissyx is working on AOT model (optimized), so perhaps it could be possible, reducing the model !

You’d like to use a model, with a 5000 possible sentences, with the best accuracy possible :
Sure, with Deepspeech, you’re in the right place.

Hoping that your sentences contain standard words,
you create a vocab.txt, containing all your ~5000 sentences, a LM and a TRIE files.
You should obtain a very good accuracy, quite better than everything known in phonemes STT
But, I must say that, actually, there is no perfect solution ! Not yet…

Sorry, Deepspeech is only, actually a STT engine.

Hope it helped

What I can adhere is, we also begin our research on TTS.

@erogol At what stage is the TTS project? are there any online resources, repos, forums etc. about the mozillian TTS?

One thing I have to wonder about this question is:

  • what is the relationship between number of sentences modelized and model size

I wonder how difficult it would be to shardize language models. Actually that sounds like a promissing approach.

hi @jtane
Well, quality model is dependant of learnt material volume.
Sure it must be good quality material.
the more material (wav sentences/corresponding transcritions), the better the model inferences !
Of course, the more material == the bigger the model !


https://discourse.mozilla.org/c/tts

1 Like