DSAlign - handling disfluencies

@tilman_kamp With the forced alignment tool, are there any particular approaches to take with regard to disfluencies (esp fillers like um, ah, etc)

I’m interested in using it for a source that has sections of audio with fairly frequent cases of them. I suspect it would be labour intensive to identify them and adjust the transcripts (which don’t reference them currently, it’s just regular text). I realise the best test it to try it, but do you have any feeling for how resilient it is with that kind of thing in general?

I am currently aligning a data-set like this. The audio is conversational while the original transcript ignores that disfluencies. Some observations:

  • As long as a (short) mismatch occurs in the middle of an utterance, the algorithm is pretty resilient and just ignores them.
  • Mismatches at the beginnings and ends of utterances (unfortunately the typical places for disfluencies) could in extreme cases cause the aligner to let the statement also eat up characters of the previous or next utterance. In this case you already “lost” two utterances.
  • It is a tough parameter tweaking game to get the best results. --align-similarity-algo has the most influence on the behavior.

“Between utterances” alignment behavior can be further fine-tuned through parameters

  • --align-shrink-fraction
  • --align-stretch-fraction
  • --align-word-snap-factor
  • --align-phrase-snap-factor

Many thanks @Tilman_Kamp - that’s very helpful to know and has convinced me to give it a go soon. Thanks again for open sourcing a great project liked this one