I would like to help improving the Slovenian language model.
Currently I am negotiating to obtain speech data from Slovenian TV stations.
Speech data would be journalists voiceovers and text transcriptions of the stories.
Majority of the recordings are done in a studio environment (clean, no background noise) and ideally I will have access to 3 “main” accents.
But the problem is that I will certainly not be able to release the data under CC (or any free license). And I currently don’t have access to a decent HW setup so training would take forever.
Also the data would need to be preprocessed before training since recordings would consist of the whole story (multiple sentences). I saw on your blog that you also gathered data from TV and radio stations, so I imagine that you had to do something similar already.
I would like to know if there is some kind of agreement that could be done with Mozilla for such data set or this would not be possible.