I would like to help improving the Slovenian language model.
Currently I am negotiating to obtain speech data from Slovenian TV stations.
Speech data would be journalists voiceovers and text transcriptions of the stories.
Majority of the recordings are done in a studio environment (clean, no background noise) and ideally I will have access to 3 “main” accents.
But the problem is that I will certainly not be able to release the data under CC (or any free license). And I currently don’t have access to a decent HW setup so training would take forever.
Also the data would need to be preprocessed before training since recordings would consist of the whole story (multiple sentences). I saw on your blog that you also gathered data from TV and radio stations, so I imagine that you had to do something similar already.
I would like to know if there is some kind of agreement that could be done with Mozilla for such data set or this would not be possible.
We have previously licensed data from broadcasters. But we’d have to license the data directly from the broadcaster. We couldn’t do so through you @JakaBac
However, the big problem I see in this case is the data is no aligned. To align the data you need a rudimentary STT engine in Slovenian. We don’t have such an engine. However, given enough aligned data we could create one.
I didn’t mean that licensing would go through me. I just wanted to know if Mozilla is open for doing something like this.
Could you please let me know what would be needed from your side for licensing procedure so I can talk to the relevant people here. We can also take this part of the conversation offline. Just let me know how to proceed.
Today I got some sample data from one of the broadcasters. I will post more details later.
For the rudimentary STT engine would it be enough that the data consists of whole story aligned aligned with the text or it would be necessary to split the voice into distinct sentences. Also please let me know how much is “enough aligned data”