Material needed for FRENCH model creation

Gman · February 9, 2018, 8:34pm

Thanks @lissyx looks a good news let’s hope it will be ready soon!

but if I can help any other way I’ll be glad to help

elpimous_robot · February 9, 2018, 9:40pm

hi @Gman,
Thanks for your proposition.
As Lissyx said, I’m waiting for common voice localization…
See U pretty soon.

mark2 · February 14, 2018, 8:34am

@lissyx, I tried Gentle for Finnish audio file. I guess it uses English acoustic and language models because most of the words are not recognized:

“words”: [
{
“alignedWord”: “”,
“case”: “success”,
“end”: 6.430000000000001,
“endOffset”: 12,
“phones”: [
{
“duration”: 0.11,
“phone”: “oov_S”
}
],
“start”: 6.32,
“startOffset”: 0,
“word”: “Oleskelulupa”
},
{
“alignedWord”: “”,
“case”: “success”,
“end”: 7.3999999999999995,
“endOffset”: 17,
“phones”: [
{
“duration”: 0.93,
“phone”: “oov_S”
}
],
“start”: 6.47,
“startOffset”: 14,
“word”: “Jos”
},
{
“alignedWord”: “”,
“case”: “success”,
“end”: 10.83,
“endOffset”: 23,
“phones”: [
{
“duration”: 0.3,
“phone”: “oov_S”
}
],
“start”: 10.53,
“startOffset”: 18,
“word”: “tulet”
}
…

So, I don’t understand how it will help you with French… If you will find some workaround, tell us!

lissyx · February 15, 2018, 1:18pm

Well, from my discussion with the people working on that, I think there was a piece about making a model for french, obviously

mark2 · February 16, 2018, 8:59am

I found very good tool for audio and text alignment which supports several languages including French, Finnish and many others - Aeneas, https://www.readbeyond.it/aeneas/

lissyx · February 16, 2018, 9:09am

Nice, however, I guess it requires some pre-processing. They document the input to be https://raw.githubusercontent.com/readbeyond/aeneas/master/aeneas/tests/res/container/job/assets/p001.xhtml, which means someone has already segmented sentences.

I’d be interested in the results it would give on a long text. Would automatically cutting off sentences using punctuation works?

My target is stuff like that: http://videos.assemblee-nationale.fr/video.4738358_595b8f797febf.1ere-seance--ouverture-de-la-session-extraordinaire--declaration-de-politique-generale-du-gouvern-4-juillet-2017?timecode=14408000 that’s 4h of audio, with subtitles.

mark2 · February 16, 2018, 9:27am

Yes, you should segment at least text (transcription) file somehow. I use NLTK-package to split raw text on punctuation or alternatively you can split anyway you want. As the result I get each segment on its own row in the transcription file. And then Aeneas does the rest!

There might be very tiny error such as the first or last word goes into wrong segment, but otherwise it performs surprisingly well.

lissyx · February 16, 2018, 9:30am

Nice ! Would you be able to give it a try on the video I linked above, if I extract you audio and text links ?

mark2 · February 16, 2018, 9:31am

Sure, what I need is a single audio-file and a single text file that contains the transcript of that audio.

elpimous_robot · February 16, 2018, 10:18am

Hi Mark2.
Good news…
Impatient to read U

lissyx · February 20, 2018, 5:45pm

FYI, Common Voice has been added on Pontoon for localization. French locale has been completed (except the lega stuff, I will do that at some point later) and is currently under review. So soonish we should be able to start building text corpus and voice collection for French :).

I don’t know yet how we should organize ourselves, but I guess Common Voice’s section of Discourse: https://discourse.mozilla.org/c/voice is a good start?

elpimous_robot · February 20, 2018, 6:18pm

What a good news !!
Impatient to test.

Gman · February 21, 2018, 10:19pm

Yes, link to pontoon is https://pontoon.mozilla.org/fr/common-voice/messages.ftl
if you want to participate to the translation

lissyx · February 22, 2018, 7:55am

That’s the first step. Then we’ll have to coordinate ourselves and build a good text corpus, but I already have my ideas on that

Gman · February 22, 2018, 8:48am

Hi @lissyx,

I found books in CC-0 on framasoft and have already split them into file of 500 sentences each but I’m making a sense check on them actually.

lissyx · February 22, 2018, 10:04am

That’s nice! What kind of books ?

elpimous_robot · February 22, 2018, 11:02am

I am interested too !!

Gman · February 22, 2018, 8:00pm

Hello,

3 novels you can find them here https://framabook.org/category/romans/ this are "le cycle des Neonautes"
and 1 more technical about CC license https://framabook.org/un-monde-sans-copyright-et-sans-monopole-2/

lissyx · February 22, 2018, 8:03pm

Nice! Something I’m a bit puzzled about is the compatibility of the AN license with CC-0: http://data.assemblee-nationale.fr/licence-ouverte-open-licence If anyone confident in licensing can have a look?

Gman · February 23, 2018, 12:04am

Hi @lissyx,

I think this sentence make the license not compaible with CC-0
Mentionner la paternité de « l’Information » : sa source (a minima le nom du « Producteur »)
et la date de sa dernière mise à jour.

I’m not into legal but I believe you will not be authorize to use it as part of the project because those information will not be under each sentence…

in CC-O you need to abandons any copyright