Phonemes Conversion

naveen · December 11, 2017, 9:15am

How can I use deepspeech to convert wav file to phonemes. I wanted to compare two person’s speech audio file and output whether they are saying the same words or where they differ. So I thought to convert speech to phoneme and then compare between that.

JanX2 · December 12, 2017, 3:48pm

As far as I can tell, this is not currently possible with Deep Speech. I would recommend having a look at Kaldi. Gentle uses it to do this under the hood.

kdavis · December 13, 2017, 2:34am

Why not just compare the text outputs?

naveen · December 13, 2017, 5:22pm

Ill take a look at it. Thanks

iveskins · December 15, 2017, 4:10pm

Gentle will just give you the timings of the sentences. AFIK it just shows phonemes from a dictionary.
I have been using the FAVE project to extract vowel sounds from sentences.
I think most TTS projects do a lot of predictions bassed on sound signal. they don’t really output exactly what people say.
But if you want to just check for whole word differences, you compare the text output of TTS.
If you want to compare things at a phoneme level… its a bit difficult, because phonemes are not really a real thing…
check out

and then

and the paper at the top
and this google summer of code project

…
you could get yourself a free google-cloud speech-to-text api key, and use autosub

to make timed text file of what they said, then compare those files… or run those files through gentle to get the word by word timings from Google’s STT transcript.

JanX2 · December 15, 2017, 6:36pm

Have a look at the Gentle website’s source code as an example, @iveskins. The JSON that is generated contains the start time for each word and the duration of each phoneme.

iveskins · December 15, 2017, 9:13pm

Oh sorry, yes, Gentle gives the timing of words, and phonemes. I meant that it doesn’t tell you what phonemes were actually said. It also outputs how certain it was.
But @naveen is asking about taking two audio files converting them to phonemes and then comparing the output. Gentle needs a transcript and an audio file.

JanX2 · December 15, 2017, 9:19pm

Yes. And it uses Kaldi behind the scenes to do exactly what the OP wants: convert the audio to phonemes. That it also does this for the text and diffs the two doesn’t help the OP, but the rest is pretty much a template for a solution.

Topic		Replies	Views
Phonetics description DeepSpeech	2	612	June 29, 2022
Word/letter timestamp with deep speech DeepSpeech	13	3788	May 16, 2019
Mprove speech to text deep speech DeepSpeech issue	3	639	November 4, 2020
Speech to text from audio file DeepSpeech learning	5	839	February 10, 2021
Speech-to-text json result with time per word DeepSpeech	3	1124	October 19, 2018

Phonemes Conversion

Related topics