Handling long fragments / sentences in DSAlign

Tortoise · February 6, 2020, 12:29pm

I have to ask that how to handle long sentences./
Sometimes, depending on the spoken, long sentences spoken by one person is aligned with a single duration length. like 10 sec 50 words or so (depends on the speaker spoken in one go), and all goes in one transcript. it would be kind if it can be broken and split into smaller possible fraction of sentences. If you can guide me how to resolve the issue. or any possibility to control no of words per transcript alignment duration?

/bin/align.sh --output-max-cer 15 --loglevel 10 --audio data/audio.wav --script data/transcript.txt --aligned data/result.json --tlog data/result.log --output-pretty --stt-max-duration 2000

VAD splitting: 0it [00:00, ?it/s]INFO:root:Fragment 0: Audio too long for STT
INFO:root:Fragment 1: Audio too long for STT

and it is missing text above this limit. How to handle such issues and split fragments into smaller possible parts.

Tilman_Kamp · February 6, 2020, 1:10pm

So far the only parameter for steering splitting behavior is --audio-vad-aggressiveness and it is already defaulting to the highest value. Another possibility is shortening the VAD time-window. Fun fact: I am currently working on a light refactoring of this code and will integrate it now (as there seems to be a need for it).

Tortoise · February 6, 2020, 1:19pm

Dear Sir, its a great help. Thank you !!

Tortoise · February 6, 2020, 2:40pm

@Tilman_Kamp Dear Sir,

As I have tested the code and it works almost fair enough to recognize the speaker / vocals . The algorithm is weighting correctly (depends on the trained model quality). Is it possible to get those numbers - weights / or like the speaker number / if it can assign the speaker. ?

Tortoise · February 11, 2020, 3:45pm

Dear Sir,

Thank you so much. You did great help. I have to ask two questions.

the DSAlign is working great and much improved but still the sentence is too long in each transcript . Can it be controlled little more in any other way like per utterance or per transcript we have like 5 words or so?
This 10 ms minimum frame duration is limitation of the library webrtcvad, but is there any other way to split more the transcript?.

Please guide.