Tool for creating custom training dataset for different languages

Jendker · November 12, 2019, 12:28pm

Is there any tool which you could recommend to add custom training data? Tool which would do the replacing for transcripts and language model, maybe some alignment checking?

I was trying to work with https://github.com/silenterus/deepspeech-cleaner but it is so difficult to debug this thing that I need to switch to something else.

lissyx · November 12, 2019, 2:25pm

We dont know about that tool.

I’m not sure I understand exactly what you want here, can you elaborate ?

Jendker · November 13, 2019, 3:41pm

Tool which would be able to produce the cleaned data from the fed audio files with transcripts. So that it would replace the abbreviations, numbers and maybe check the audio alignment with transcript if possible. The mentioned tool does something like this (maybe without alignment), but there are some problem which are pretty difficult in this setup.

I am just curious what are the currently used tools for creating new language datasets by users (or Mozilla), maybe I missed something.

lissyx · November 13, 2019, 3:59pm

For French Common Voice, this is something we try to deal with before submitting the text data to Sentence Collector. Generally speaking, it’s really dataset-specific.

Jendker · November 13, 2019, 4:03pm

I see, it’s much work in general… If someone stumbles on something what could be useful for different languages (particularly German) please let me know. I think it would be useful for many.

Thanks!

lissyx · November 13, 2019, 4:09pm

It looks like people working on german should work together …

dabinat · November 13, 2019, 4:37pm

You may be able to modify the wiki scraper code for your use-case:

It has cleanup rules for various languages that you could use for cleaning your dataset.