TTS for film subtitles

shad94 · November 17, 2019, 2:55pm

Hi,
I am currently working on Polish version of TTS, but my final goal is to obtain a Polish-speaking lector for films.
Of course, I can use simple program like Sony Vegas Studio to merge film with my .wav file, but my question is, how to generate .wav file, which will exactly fit into intervals of time?
Example:
a person is speaking from (mm:ss:msms) 00:00:02 to 00:08:01, and next person is speaking from 00:11:01 to 00:19:22.
My lector is a single voice, no division into man/female voice etc.

Do you have any advice how to do it?

Regards

nmstoker · November 18, 2019, 12:16am

I think this is slightly outside the scope of the project, but I was curious about your goal so took a look for potential solutions

Lector in the context you give wasn’t familiar to me, but am I right to understand you want to be able to read subtitles for films with TTS, and you’re trying to make sure that they are spoken to fit into the associated time period so that it is lined up with the video correctly?

There could be other options from googling, but I found the repo below which allows adjusting audio duration without affecting the pitch (handy if you don’t want it to sound like chipmunks!) There’s a demo in one of the notebooks under the examples folder which looks like it might do the trick.

I’m guessing it might get more complicated in reality but roughly you’d take your sentence, run it through TTS, get the time of the sentence audio generated and stretch it by the factor that the TTS time is different to the time slot you’ve got from the video and then append that to your output soundtrack.