Spoken language vs written language in Tamil

natkeeran · October 30, 2019, 4:43pm

There are several issues with the current sentences provided by Common Speech for Tamil. First, often they are not complete sentences, or even express a coherent thought. I assume this is still acceptable. It also suffers from some of the issues noted in this discussion thread: Extending our sentence collection capabilities.

First, we need to be able to provide sentences from varied sources than ancient literature. We can possibly get permission from contemporary authors to release their work under required license (CC0) for this. Example from blog posts!

Second, Tamil has notably different spoken and written/ literary language variants. Written language is more formal and understood across dialects and regions. However, the spoken Tamil is notably different. Would like to know if we can crowdsource the sentences for spoken Tamil?

Please point to documentation about how to contribute to source sentences.

Also, is it possible to use existing audio transcripts to create training datasets of public domain works?

Thank you.

nukeador · October 30, 2019, 4:43pm

Hi,

In this post we list the steps to get a language ready for Common Voice

On the sentence collection front there are a few channels: the sentence collector and working to get large sources of text included in a legal way, like wikipedia.

natkeeran · October 30, 2019, 4:53pm

@nukeador

Thank you.

https://common-voice.github.io/sentence-collector/ is the type of tool I was looking for. Great.

Wondering if there is any way to structure/process exiting CC0 Audio and WebVTT and submit it for dataset.

lissyx · October 30, 2019, 4:58pm

If you have CC-0 audio and matching WebVTT, you can directly produce a dataset for DeepSpeech.

natkeeran · November 1, 2019, 3:07pm

@lissyx Can you please point me to the tooling to generate/contribute WebVTT and audio dataset to the common dataset.

lissyx · November 1, 2019, 3:25pm

You said you already had those data. I don’t know tooling for WebVTT, sorry.

natkeeran · November 1, 2019, 3:37pm

We have a project to create Audio Books and transcribed materials. We can take an additional step to create time coded transcript according to WebVTT standards.

Would like to know we can contribute that to the Common Voice dataset? (i.e do we need to do any processing?)

lissyx · November 1, 2019, 4:57pm

No, I meant that if you have text and audio, you can directly use that in DeepSpeech.

I can’t help you tir webvtt but as soon as you have that transformed to text only you can use sentence collector to contribute

natkeeran · November 1, 2019, 5:13pm

I see, thank you.

Ideally, if there is a way for us to split the wevttt and audio into sentences and contribute to the common voice, that might speed up the dataset development, specially for small languages. I understand that will be manual work.

dabinat · November 1, 2019, 5:23pm

I have a script to create WAV files from an SRT, which isn’t vastly different to WebVTT. I’ve been meaning to open-source it, it just needs some cleaning up and I probably won’t have time for another few weeks.

But here’s a problem you’ll probably run into: the subtitles may not line up exactly with the text. You’d need forced alignment to solve that: https://github.com/mozilla/DSAlign