Speech-to-text json result with time per word

Hi,

I’m currently using the SpeechMatics.com API to transcribe audio files into text, in the following json format

[
{name: "word1, time: 130, …}
{name: "word2, time: 132, …},

]

but considering the cost per minute, I want to use my own engine, I tested deepspeech and I think with learning, I will arrive at a good result, the only problem is that the text is in raw, and it is impossible for me to know when words was pronounced

any idea to reproduce speechmatics api result ?

thanx in advance, and sorry for my bad english

Why don’t you use the library or its binding and build it yourself ? Besides, we have no way to produce a “time” that gets you when the word was spoken. There’s already github issue filed about that.

I think you can achieve something similar with:

  • VAD
  • our streaming API

As you can see, libdeepspeech API will return you just a string, but then you can deal with that and produce JSON.

so from the referenced github issue it looks like the new ctc, once integrated to the project, could provide the character timestamps.

see this comment