Is there any tool which you could recommend to add custom training data? Tool which would do the replacing for transcripts and language model, maybe some alignment checking?
Tool which would be able to produce the cleaned data from the fed audio files with transcripts. So that it would replace the abbreviations, numbers and maybe check the audio alignment with transcript if possible. The mentioned tool does something like this (maybe without alignment), but there are some problem which are pretty difficult in this setup.
I am just curious what are the currently used tools for creating new language datasets by users (or Mozilla), maybe I missed something.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
4
For French Common Voice, this is something we try to deal with before submitting the text data to Sentence Collector. Generally speaking, it’s really dataset-specific.
I see, it’s much work in general… If someone stumbles on something what could be useful for different languages (particularly German) please let me know. I think it would be useful for many.
Thanks!
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
6
It looks like people working on german should work together …