Hello,
I’m supporting a group in Kyrgyzstan who has collected a dataset with 10000 sentences and the associated audio. There is no copyright on the data. Can we get this added to the existing Kyrgyz dataset on Common Voice?
Any help would be greatly appreciated.
2 Likes
nukeador
(Rubén Martín [❌ taking a break from Mozilla])
2
The Common Voice dataset is formed only by the audios collected by our site. If you have a full dataset (text and audio) for your language, it would be better to publish it somewhere with details about the methodology and QA controls you have used to collect it.
This dataset might be usable on #deep-speech model training together with the Common Voice dataset.