Material needed for FRENCH model creation

elpimous_robot · December 19, 2017, 4:09pm

Hi all, I’d like to create a FRENCH Deepspeech model :

Voxforge provides ~140 hours, and is not too difficult to clean.
the Lium (university of Mons) provides lot of hours, but audio format ‘sph’ (sphere) must be converted to wav.

If you know location for others audio material place, please answer here.

Thanks all

lissyx · January 5, 2018, 5:44am

One thing that we could work on would be to extract audio and text from french parliament sessions. At least I know the debates’ transcriptions are not 100% accurates, but the written speeches from ministers for examples, should be.

elpimous_robot · January 5, 2018, 11:07am

Hi Lissyx,
Yes, it could be a good way to gain wav’s.
Now, need a tool to properly cut sentences, ( because we can’t estimate a pause duration as a endligne (“point”)

Audiobooks (librivox) are a good way, too

yv001 · January 5, 2018, 12:12pm

Not sure about French specifically but other European countries have national language institutions that are building and sharing language corpora including audio corpora.

The both collect data themselves and use sources like transcribed news from national TV. You can check if that’s available for French.

lissyx · January 5, 2018, 12:16pm

Interesting, do you have any links for that? I’ve failed finding anything at EU level

yv001 · January 5, 2018, 12:51pm

I was looking for a national language specifically rather then a unified EU platform but some raw resources (different licenses, different formats) for several languages can be found here:

list of corpora

I suppose, It’s better than starting from scratch, especially when used for non-commercial purposes.

elpimous_robot · January 5, 2018, 12:57pm

Yes, nice idea.

Question : does anyone has an idea for cutting audio/text ?
I think that we must limit sentences duration : a long sentences can cause bad synchronization.
Cutting a sentence at a end point is the normal way, but about Vad ?!

Please, add your ideas (or programs link, LOL)

lissyx · January 5, 2018, 1:11pm

I wanted (but had no time) to try and use Gentle [https://lowerquality.com/gentle/] for that on some videos from assemblee-nationale.fr

elpimous_robot · January 5, 2018, 1:14pm

Nice, Lissyx.
I’ll have a loot at this opensource program.

Gman · February 9, 2018, 8:31pm

Hi @elpimous_robot elpimous,

Are you still looking to create a french model?
Is there anything we can do to help?

I’ve trid to record me on VoxForg but haven’t been validated…so I suppose the project is dead and halas, Mozilla doesn’t give yet the opportuniy ro record voice other than English.

Regards,

G

lissyx · February 9, 2018, 8:24pm

Common Voice is being localized, as we speak. So it should be possible to work on other languages as soon as this is completed. @mikehenrty ?

Gman · February 9, 2018, 8:34pm

Thanks @lissyx looks a good news let’s hope it will be ready soon!

but if I can help any other way I’ll be glad to help

elpimous_robot · February 9, 2018, 9:40pm

hi @Gman,
Thanks for your proposition.
As Lissyx said, I’m waiting for common voice localization…
See U pretty soon.

mark2 · February 14, 2018, 8:34am

@lissyx, I tried Gentle for Finnish audio file. I guess it uses English acoustic and language models because most of the words are not recognized:

“words”: [
{
“alignedWord”: “”,
“case”: “success”,
“end”: 6.430000000000001,
“endOffset”: 12,
“phones”: [
{
“duration”: 0.11,
“phone”: “oov_S”
}
],
“start”: 6.32,
“startOffset”: 0,
“word”: “Oleskelulupa”
},
{
“alignedWord”: “”,
“case”: “success”,
“end”: 7.3999999999999995,
“endOffset”: 17,
“phones”: [
{
“duration”: 0.93,
“phone”: “oov_S”
}
],
“start”: 6.47,
“startOffset”: 14,
“word”: “Jos”
},
{
“alignedWord”: “”,
“case”: “success”,
“end”: 10.83,
“endOffset”: 23,
“phones”: [
{
“duration”: 0.3,
“phone”: “oov_S”
}
],
“start”: 10.53,
“startOffset”: 18,
“word”: “tulet”
}
…

So, I don’t understand how it will help you with French… If you will find some workaround, tell us!

lissyx · February 15, 2018, 1:18pm

Well, from my discussion with the people working on that, I think there was a piece about making a model for french, obviously

mark2 · February 16, 2018, 8:59am

I found very good tool for audio and text alignment which supports several languages including French, Finnish and many others - Aeneas, https://www.readbeyond.it/aeneas/

lissyx · February 16, 2018, 9:09am

Nice, however, I guess it requires some pre-processing. They document the input to be https://raw.githubusercontent.com/readbeyond/aeneas/master/aeneas/tests/res/container/job/assets/p001.xhtml, which means someone has already segmented sentences.

I’d be interested in the results it would give on a long text. Would automatically cutting off sentences using punctuation works?

My target is stuff like that: http://videos.assemblee-nationale.fr/video.4738358_595b8f797febf.1ere-seance--ouverture-de-la-session-extraordinaire--declaration-de-politique-generale-du-gouvern-4-juillet-2017?timecode=14408000 that’s 4h of audio, with subtitles.

mark2 · February 16, 2018, 9:27am

Yes, you should segment at least text (transcription) file somehow. I use NLTK-package to split raw text on punctuation or alternatively you can split anyway you want. As the result I get each segment on its own row in the transcription file. And then Aeneas does the rest!

There might be very tiny error such as the first or last word goes into wrong segment, but otherwise it performs surprisingly well.

lissyx · February 16, 2018, 9:30am

Nice ! Would you be able to give it a try on the video I linked above, if I extract you audio and text links ?

mark2 · February 16, 2018, 9:31am

Sure, what I need is a single audio-file and a single text file that contains the transcript of that audio.