Speech-to-text json result with time per word

noobski_21 · October 19, 2018, 7:44am

Hi,

I’m currently using the SpeechMatics.com API to transcribe audio files into text, in the following json format

[
{name: "word1, time: 130, …}
{name: "word2, time: 132, …},
…
]

but considering the cost per minute, I want to use my own engine, I tested deepspeech and I think with learning, I will arrive at a good result, the only problem is that the text is in raw, and it is impossible for me to know when words was pronounced

any idea to reproduce speechmatics api result ?

thanx in advance, and sorry for my bad english

lissyx · October 19, 2018, 7:47am

Why don’t you use the library or its binding and build it yourself ? Besides, we have no way to produce a “time” that gets you when the word was spoken. There’s already github issue filed about that.

lissyx · October 19, 2018, 8:41am

I think you can achieve something similar with:

VAD
our streaming API

As you can see, libdeepspeech API will return you just a string, but then you can deal with that and produce JSON.

yv001 · October 19, 2018, 1:49pm

so from the referenced github issue it looks like the new ctc, once integrated to the project, could provide the character timestamps.

see this comment

Topic		Replies	Views
Using deep speech to get timestamp for each word, not only string DeepSpeech	1	2088	February 17, 2019
Word/letter timestamp with deep speech DeepSpeech	13	3806	May 16, 2019
Do DeepSpeech have subtitle (SRT) output mode? How can I merge words into the proper sentences? DeepSpeech	13	2740	April 26, 2021
Word timestamps DeepSpeech	0	636	August 27, 2018
Getting word timestamps in python DeepSpeech	2	1440	October 26, 2021

Speech-to-text json result with time per word

Related topics