Hungarian language

Hi!

I would like to contribute to the Hungarian voices, but when I try it, it seems to be only in the Sentence Collection phase. How can I contribute to the Sentence Collection? I guess translating 1000-2000 English sentences to Hungarian will take no time, and then voice collection could start.

Unfortunately I could not find any information about it on the site.

Thanks,
Andrew

Hello and welcome to the community!

Please check our pinned readme and let us know if you have additional questions:

Thanks!

Hi András,
We are past 4300 validated sentences, only less than 700 is missing until we have enough to start the voice collection. Please, go and validate some of the sentences, by logging into the Sentence Collector portal here: https://common-voice.github.io/sentence-collector/?#/
Thanks!

A quick recommendation: Please explore the possibility to do the sentence extraction process for Hungarian to get as many sentences as possible sooner.

Sentence collector is only recommended if you have already done this and the EuroParl or other CC-0 big sources first, and you want to incorporate more sentence diversity. Note that the manual sentence collection is a slow process that takes some time and you will run out of sentences to record really soon with just the initial 5000.

And to put into perspective what “really soon” means - after collecting the initial 5000 for czech for like a year, the sentences were gone through within a week after launch, iirc. And all it needed was just a launch announcement on several czech sites.

I’ve added this PR for initial feedback. I’m still searching for a better way to create the blocklist, but results look promising.

Would you mind elaborating on what a better way would be? What way did you use to create it?

I added the details to the PR, It has 3 parts:

  • a manual list of entries based on the frequent entries in the word usage list
  • all words were stemmed and the occurrence count was summed up to the stem of the word.
    • All stems and their occurrences with fewer than 12 count were removed.
    • All stems and their occurrences with more than 12 count and lower than 36 count with fewer than 4 occurence was removed too.

I noticed that the Sentence collector has more than 5k sentences now. Is it possible to add those while the wikipedia extractor PR is in progress? Based on my experience the Sentence collector has much higher quality data than the Wiki extractor ATM.

Yes, the PR for that is here: https://github.com/mozilla/voice-web/pull/2845. That was merged an hour ago and will be part of the next release.

Awesome, thanks @mkohler.