Tool for creating custom training dataset for different languages

Is there any tool which you could recommend to add custom training data? Tool which would do the replacing for transcripts and language model, maybe some alignment checking?

I was trying to work with https://github.com/silenterus/deepspeech-cleaner but it is so difficult to debug this thing that I need to switch to something else.

We dont know about that tool.

I’m not sure I understand exactly what you want here, can you elaborate ?

Tool which would be able to produce the cleaned data from the fed audio files with transcripts. So that it would replace the abbreviations, numbers and maybe check the audio alignment with transcript if possible. The mentioned tool does something like this (maybe without alignment), but there are some problem which are pretty difficult in this setup.

I am just curious what are the currently used tools for creating new language datasets by users (or Mozilla), maybe I missed something.

For French Common Voice, this is something we try to deal with before submitting the text data to Sentence Collector. Generally speaking, it’s really dataset-specific.

I see, it’s much work in general… If someone stumbles on something what could be useful for different languages (particularly German) please let me know. I think it would be useful for many.

Thanks!

It looks like people working on german should work together …

You may be able to modify the wiki scraper code for your use-case:

It has cleanup rules for various languages that you could use for cleaning your dataset.

1 Like