Confidence of STT transcription

I am using deepspeech 0.6.1 which is until now providing the best transcription.
I tried to use 0.8.2 but maybe due to the reason the model scorer building is different, and I I was not able to build correct lm and scorer until now, I could not get the results comparable.

`My question is` 

`Is it possible to get the confidence value for STT transcription while transcription and alignment? My deepspeech = 0.6.1 and DSAlign is also 0.6.1 version.

I mean the confidence value for each fragment? for deepspeech 0.6.1 with DSAlign.
Please guide.

As far as I know the current DSAlign can only be run with the newer models >= 0.7, but it is a bit older so you could use an older commit. Check the repo. But as Tilman is no longer on the project, you won’t have any support.

You are right. Just question is about the confidence/? I have the old version as well. Just the change between is logistics.
Newer version can provide confidence?

Don’t know got to try yourself.

1 Like

No, we can’t provide features from newer version on older.

Help yourself, read the documentation of the API: https://deepspeech.readthedocs.io/en/v0.9.1/Structs.html#_CPPv4N19CandidateTranscript10confidenceE

We are happy to try and support people to the extents of our availability, but please help us by extensively checking the documentation first.

Basically, searching for “confidence” would have been a good start: https://deepspeech.readthedocs.io/en/v0.9.1/search.html?q=confidence&check_keywords=yes&area=default

Dear Friends. This I know. Thank you again.
OK, in easy words.

How to call this confidence value while transcribing because there is no flag I can see with deepspeech flags. If you can guide. So that I can get time start time end and transcript with confidence for that fragment.

As said multiple times and linked by the doc, this is exposed in the metadata of the API.
Some CLI clients provides --json but you should rather use the bindings of your language, if it exists, instead of relying on executing third-party binary like that. But again, since you don’t share details on what you are using, we can’t help efficiently.

I am transcribing the wav file as below with DSAlign and deepspeech (0.7.1)

./bin/align.sh --audio /home/data/audio.wav --script /home/data/raw_empty.txt --aligned /home/data/text_aligned.json --tlog /home/data/text_log.json --output-pretty
where I get the output as below

    {
        "duration": 5.06,
        "start": 0.0,
        "end": 5.06,
        "transcript": "I am a boy"
    }

But I want like this:

    {
        "duration": 5.06,
        "start": 0.0,
        "end": 5.06,
        "confidence": "0.95"
        "transcript": "I am a boy"
    }

each transcription should have its own confidence value from None or 0.00 to 1.00 depends on the audio fragment quality. Maybe now you get my point.

where is the code that produces it?

It is under the link

https://github.com/mozilla/DSAlign

Then you need to change DSAlign.

Thank you. I got it. I tried and successful with your guide.
The confidence I am getting is for each single wav. file one time at the end of transcript. Is it possible to get it for each word? Like now what I tested is I am getting for each audio file one time at the end of transcript. I want to check how much accurate / confidence is each word.

Please explore the documentation of the Metadata structure.

1 Like