Material needed for FRENCH model creation

Nice, Lissyx.
I’ll have a loot at this opensource program.

Hi @elpimous_robot elpimous,

Are you still looking to create a french model?
Is there anything we can do to help?

I’ve trid to record me on VoxForg but haven’t been validated…so I suppose the project is dead and halas, Mozilla doesn’t give yet the opportuniy ro record voice other than English.



Common Voice is being localized, as we speak. So it should be possible to work on other languages as soon as this is completed. @mikehenrty ?

Thanks @lissyx looks a good news let’s hope it will be ready soon!

but if I can help any other way I’ll be glad to help

hi @Gman,
Thanks for your proposition.
As Lissyx said, I’m waiting for common voice localization…
See U pretty soon.

@lissyx, I tried Gentle for Finnish audio file. I guess it uses English acoustic and language models because most of the words are not recognized:

“words”: [
“alignedWord”: “”,
“case”: “success”,
“end”: 6.430000000000001,
“endOffset”: 12,
“phones”: [
“duration”: 0.11,
“phone”: “oov_S”
“start”: 6.32,
“startOffset”: 0,
“word”: “Oleskelulupa”
“alignedWord”: “”,
“case”: “success”,
“end”: 7.3999999999999995,
“endOffset”: 17,
“phones”: [
“duration”: 0.93,
“phone”: “oov_S”
“start”: 6.47,
“startOffset”: 14,
“word”: “Jos”
“alignedWord”: “”,
“case”: “success”,
“end”: 10.83,
“endOffset”: 23,
“phones”: [
“duration”: 0.3,
“phone”: “oov_S”
“start”: 10.53,
“startOffset”: 18,
“word”: “tulet”

So, I don’t understand how it will help you with French… If you will find some workaround, tell us!

Well, from my discussion with the people working on that, I think there was a piece about making a model for french, obviously :slight_smile:

I found very good tool for audio and text alignment which supports several languages including French, Finnish and many others - Aeneas,

1 Like

Nice, however, I guess it requires some pre-processing. They document the input to be, which means someone has already segmented sentences.

I’d be interested in the results it would give on a long text. Would automatically cutting off sentences using punctuation works?

My target is stuff like that: that’s 4h of audio, with subtitles.

Yes, you should segment at least text (transcription) file somehow. I use NLTK-package to split raw text on punctuation or alternatively you can split anyway you want. As the result I get each segment on its own row in the transcription file. And then Aeneas does the rest!

There might be very tiny error such as the first or last word goes into wrong segment, but otherwise it performs surprisingly well.

Nice ! Would you be able to give it a try on the video I linked above, if I extract you audio and text links ?

Sure, what I need is a single audio-file and a single text file that contains the transcript of that audio.

Hi Mark2.
Good news…
Impatient to read U

FYI, Common Voice has been added on Pontoon for localization. French locale has been completed (except the lega stuff, I will do that at some point later) and is currently under review. So soonish we should be able to start building text corpus and voice collection for French :).

I don’t know yet how we should organize ourselves, but I guess Common Voice’s section of Discourse: is a good start?

What a good news !!
Impatient to test.

Yes, link to pontoon is
if you want to participate to the translation

That’s the first step. Then we’ll have to coordinate ourselves and build a good text corpus, but I already have my ideas on that :slight_smile:

Hi @lissyx,

I found books in CC-0 on framasoft and have already split them into file of 500 sentences each but I’m making a sense check on them actually.

That’s nice! What kind of books ?

I am interested too !!