Hungarian language

NoiseEHC · July 6, 2020, 8:50am

Hi!

I would like to contribute to the Hungarian voices, but when I try it, it seems to be only in the Sentence Collection phase. How can I contribute to the Sentence Collection? I guess translating 1000-2000 English sentences to Hungarian will take no time, and then voice collection could start.

Unfortunately I could not find any information about it on the site.

Thanks,
Andrew

nukeador · July 6, 2020, 10:33am

Hello and welcome to the community!

Please check our pinned readme and let us know if you have additional questions:

Thanks!

djlancelot · July 17, 2020, 8:31pm

Hi András,
We are past 4300 validated sentences, only less than 700 is missing until we have enough to start the voice collection. Please, go and validate some of the sentences, by logging into the Sentence Collector portal here: https://common-voice.github.io/sentence-collector/?#/
Thanks!

nukeador · July 17, 2020, 9:22pm

A quick recommendation: Please explore the possibility to do the sentence extraction process for Hungarian to get as many sentences as possible sooner.

Sentence collector is only recommended if you have already done this and the EuroParl or other CC-0 big sources first, and you want to incorporate more sentence diversity. Note that the manual sentence collection is a slow process that takes some time and you will run out of sentences to record really soon with just the initial 5000.

Adrijaned · July 18, 2020, 9:05am

And to put into perspective what “really soon” means - after collecting the initial 5000 for czech for like a year, the sentences were gone through within a week after launch, iirc. And all it needed was just a launch announcement on several czech sites.

djlancelot · July 21, 2020, 3:44pm

I’ve added this PR for initial feedback. I’m still searching for a better way to create the blocklist, but results look promising.

mkohler · July 22, 2020, 9:39pm

Would you mind elaborating on what a better way would be? What way did you use to create it?

djlancelot · July 27, 2020, 8:42am

I added the details to the PR, It has 3 parts:

a manual list of entries based on the frequent entries in the word usage list
all words were stemmed and the occurrence count was summed up to the stem of the word.
- All stems and their occurrences with fewer than 12 count were removed.
- All stems and their occurrences with more than 12 count and lower than 36 count with fewer than 4 occurence was removed too.

djlancelot · July 27, 2020, 8:45am

I noticed that the Sentence collector has more than 5k sentences now. Is it possible to add those while the wikipedia extractor PR is in progress? Based on my experience the Sentence collector has much higher quality data than the Wiki extractor ATM.

mkohler · July 27, 2020, 4:35pm

Yes, the PR for that is here: https://github.com/mozilla/voice-web/pull/2845. That was merged an hour ago and will be part of the next release.

djlancelot · July 27, 2020, 5:45pm

Awesome, thanks @mkohler.

Topic		Replies	Views
📖 Readme: How to see my language on Common Voice Common Voice announcements	40	14341	May 10, 2022
Languages addressed Common Voice	24	3844	May 15, 2018
Common Voice Sentence Collection Tool launch Common Voice sentence-collection , announcements	15	4290	April 2, 2019
Getting a Language (Korean) to Be Speaking/Listening Ready Common Voice	2	451	November 2, 2022
Polish language ready to recording and reviewing recordings Common Voice participation , learning , sentence-collection	3	1440	August 26, 2019

Hungarian language

Related topics