Large Romanian Dataset Formating Help

Hi,

So one of my friends was nice enough to pass on the Romanian subtitles that they had written for a bunch for of movies over several years. The person gave me written approval to publish the sentences in the public domain, however the format of these files are problematic. I have about 9 MB of text files but it would greatly help if you guys could help me format the files in a way so we could submit it to the common voice project. The total amount of lines is roughly 289 348 & there’s 817 files. If anyone would like to help it would be very appreciated.

Thanks

Hi,

Can you upload the files to github or similar so people can check the current format and suggest options?

Thanks!

@nukeador I will shortly.

1 Like

Okay so it wasn’t shortly, but I dropped them here: https://github.com/missuniverse/Romanian
@nukeador

@dabinat might be able to help you split sentences to be able to submit them to the sentence collector.