Force alignment (synchronize audio with text)

Hi, i am trying to build a Greek dataset for DeepSpeech Project. I checked LibriVox and other available datasets and i thing i can get a big enough dataset. The problem is when i apply force alignment with Aeneas library (check tutorial: https://medium.com/@klintcho/creating-an-open-speech-recognition-dataset-for-almost-any-language-c532fb2bc0cf)
When i export json file i see that there is a big variance from the correct start/end time and i have to synchronize them manually. Is there any other tool that i could use to synchronize audio with text (available in many languages + greek) and export json file or similar format?

1 Like

@Tilman_Kamp Do you have any suggestions?

@ctzogka We are currently also working on forced alignment based on DeepSpeech, but this is pre-alpha and not tested on any other language than English. There is a quite comprehensive list of forced alignment tools. For your case the CMU Sphinx aligner seems to be a good first bet, as there is a Greek model. There is also an example on how to use it.

Thank you for the response, I am glad to hear that you are also working on force alignment! CMU Sphinx aligner isn’t what i was expecting… as i can see it exports csv with word timestamp + phonemic prounciation. Also i can’t run the project and i think that’s due to “This repository has been archived by the owner. It is now read-only.”
My opinion is that Aeneas is comfortable for DeepSpeech but it demands fine-tuning, which is time-consuming. Moreover, it would be nice to find or create an interface where, besides fine-tuning, users could edit the inference, when it’s not the ground-truth.

Hey @Tilman_Kamp I am also interested in deepspeech based forced alignment tool, can you share a bit more info about that? What is the release date?

here is the repolink of the mentioned tool by @Tilman_Kamp https://github.com/mozilla/DSAlign/tree/master

That’s the repo, yes… :slight_smile: Plan is to get it productive (for our purposes) in a couple of weeks. The main focus is labeling audio data on a phrase-by-phrase basis for training DeepSpeech.

2 Likes

Hey,

I am also interested in force alignment to automatically generate karaoke timestamped lyrics (e.g. .lrc files) given an audio and a lyrics files (.mp3 and *.txt).

I am looking forward to seeing where this threads is heading !