beomgon.yu
(Beomgon Yu)
January 28, 2020, 7:11am
1
DeepSpeech Model
================
The aim of this project is to create a simple, open, and ubiquitous speech
recognition engine. Simple, in that the engine should not require server-class
hardware to execute. Open, in that the code and models are released under the
Mozilla Public License. Ubiquitous, in that the engine should run on many
platforms and have bindings to many different languages.
The architecture of the engine was originally motivated by that presented in
`Deep Speech: Scaling up end-to-end speech recognition <http://arxiv.org/abs/1412.5567>`_.
However, the engine currently differs in many respects from the engine it was
originally motivated by. The core of the engine is a recurrent neural network (RNN)
trained to ingest speech spectrograms and generate English text transcriptions.
Let a single utterance :math:`x` and label :math:`y` be sampled from a training set
.. math::
S = \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), . . .\}.
This file has been truncated. show original
in above figure,
input sequence length is different for each audio files.
the size of input seq is fixed? if then, where is the code for fixing(chunking or etc)
if not, input seq length of graph is daynamic??
I am confused about this.
thanks
The RNN looks at the features from 20ms time-slices of the audio input, if that answers your question. I’m not sure where to find the code for that, but somewhere in the git repo.
reuben
February 3, 2020, 7:53pm
3
And the sequence length is dynamic. This is why the graph takes sequence lengths as input.