Force alignment (synchronize audio with text)

Hi, i am trying to build a Greek dataset for DeepSpeech Project. I checked LibriVox and other available datasets and i thing i can get a big enough dataset. The problem is when i apply force alignment with Aeneas library (check tutorial: https://medium.com/@klintcho/creating-an-open-speech-recognition-dataset-for-almost-any-language-c532fb2bc0cf)
When i export json file i see that there is a big variance from the correct start/end time and i have to synchronize them manually. Is there any other tool that i could use to synchronize audio with text (available in many languages + greek) and export json file or similar format?

1 Like

@Tilman_Kamp Do you have any suggestions?

@ctzogka We are currently also working on forced alignment based on DeepSpeech, but this is pre-alpha and not tested on any other language than English. There is a quite comprehensive list of forced alignment tools. For your case the CMU Sphinx aligner seems to be a good first bet, as there is a Greek model. There is also an example on how to use it.

1 Like

Thank you for the response, I am glad to hear that you are also working on force alignment! CMU Sphinx aligner isn’t what i was expecting… as i can see it exports csv with word timestamp + phonemic prounciation. Also i can’t run the project and i think that’s due to “This repository has been archived by the owner. It is now read-only.”
My opinion is that Aeneas is comfortable for DeepSpeech but it demands fine-tuning, which is time-consuming. Moreover, it would be nice to find or create an interface where, besides fine-tuning, users could edit the inference, when it’s not the ground-truth.

Hey @Tilman_Kamp I am also interested in deepspeech based forced alignment tool, can you share a bit more info about that? What is the release date?

here is the repolink of the mentioned tool by @Tilman_Kamp https://github.com/mozilla/DSAlign/tree/master

That’s the repo, yes… :slight_smile: Plan is to get it productive (for our purposes) in a couple of weeks. The main focus is labeling audio data on a phrase-by-phrase basis for training DeepSpeech.

3 Likes

Hey,

I am also interested in force alignment to automatically generate karaoke timestamped lyrics (e.g. .lrc files) given an audio and a lyrics files (.mp3 and *.txt).

I am looking forward to seeing where this threads is heading !

Hi,

What is the state of the forced alignment with DeepSpeech? The repo looks quite good already, but could you confirm what is it’s status?

Hi @Tilman_Kamp, i’ve been using DSAlign and thank you so much for the work you’ve put in, it’s fantastic.

I have a question about workflow for generating new training data for DeepSpeech using DSAlign. We can do it using EXPORT but is there a way to avoid generating wav files etc and just use the new transcriptions to train DeepSpeech once we’ve generated them in DSAlign. Kind of how in DSAlign we call DeepSpeech for inference, could we also call on DeepSpeech for training? It would make the workflow a lot smoother to have that sort of integration.