How to sync speech with video original?

I’m quite new to this all. I have to take on a project and am a bit out of my depth. Trying to figure out a viable tooling architecture from a high level first before getting my hands dirty.

The project I’m taking on is a full speech-to-speech translation for a body of work by a researcher. The source language is English. I need to land with s2s translations in the languages Spanish, French, German, Russian & Arabic.

Questions:

  • Are there pre-trained models for these languages available with Mozilla tts? If not, can anyone suggest an alternative tts that has all of these pre-trained?

  • How do I go about syncing the video to audio? This doesn’t have to be perfect. Dubovers are typically a bit off. The cheap way should be something like taking timestamps from the ASR then feeding this into TTS? I don’t know if there is some simple standard approach.