Phonemes Conversion


(Naveen) #1

How can I use deepspeech to convert wav file to phonemes. I wanted to compare two person’s speech audio file and output whether they are saying the same words or where they differ. So I thought to convert speech to phoneme and then compare between that.


(Jan) #2

As far as I can tell, this is not currently possible with Deep Speech. I would recommend having a look at Kaldi. Gentle uses it to do this under the hood.


(kdavis) #3

Why not just compare the text outputs?


(Naveen) #4

Ill take a look at it. Thanks


(Iveskins) #5

Gentle will just give you the timings of the sentences. AFIK it just shows phonemes from a dictionary.
I have been using the FAVE project to extract vowel sounds from sentences.
I think most TTS projects do a lot of predictions bassed on sound signal. they don’t really output exactly what people say.
But if you want to just check for whole word differences, you compare the text output of TTS.
If you want to compare things at a phoneme level… its a bit difficult, because phonemes are not really a real thing…
check out


and then

and the paper at the top
and this google summer of code project


you could get yourself a free google-cloud speech-to-text api key, and use autosub

to make timed text file of what they said, then compare those files… or run those files through gentle to get the word by word timings from Google’s STT transcript.


(Jan) #6

Have a look at the Gentle website’s source code as an example, @iveskins. The JSON that is generated contains the start time for each word and the duration of each phoneme.


(Iveskins) #7

Oh sorry, yes, Gentle gives the timing of words, and phonemes. I meant that it doesn’t tell you what phonemes were actually said. It also outputs how certain it was.
But @naveen is asking about taking two audio files converting them to phonemes and then comparing the output. Gentle needs a transcript and an audio file.


(Jan) #8

Yes. And it uses Kaldi behind the scenes to do exactly what the OP wants: convert the audio to phonemes. That it also does this for the text and diffs the two doesn’t help the OP, but the rest is pretty much a template for a solution.