I created a pull request for the German corpus:
Anyone who wants to help with the review process is welcome to help
This corpus has around 379 k sentences after cleanup. Am I too quick here, would you prefere another process?
I created a pull request for the German corpus:
Anyone who wants to help with the review process is welcome to help
This corpus has around 379 k sentences after cleanup. Am I too quick here, would you prefere another process?
Yes, let’s have a separate process here, this is too complex to just have it on a PR.
I’ll reach out directly to explore which options we have for this corpus.
Sounds good. I personally reviewed a thousand sentences out of the 60K I added for Dutch (so more than 1.5% of then), and didn’t find a single sentence that would be bad to have in the corpus. The two worst sentences I found were a sentence where a space was missing (“overPakistan” instead of “over Pakistan”) and another one that said without context “nuclear plants are bombs waiting to explode” but honestly I don’t think anybody would be truly upset if either sentence ended up in the dataset. So, I validated all the sentences for myself, but was planning on letting a second review happen, but if there’s a different process I’m fine with this as well. I can also provide longer sentences from that dataset, similarly to what has been done for German, if the German sentences are deemed ok (they should all be translations of each other in the end).
@FremyCompany I’m working with @stergro for the German ones, to avoid overloading the sentence collector.
I think we should take an unified approach for this corpus and applied to all languages. How many sentences were available for Dutch?
The same amount as in German, I’d say. There are only 60K in the sentence collector because I focused on a strict set of rules, while German sentences were selected more generously from the dataset, but if the sentences from German are deemed ok, I can apply the same filtering rules as them and get approximately the same amount of sentences, since those sentences are translations of each other.
For German we are talking to almost 500K sentences. I would prefer if we can do something similar for Dutch outside the sentence collector.
My advise for Dutch would be:
Cheers.
Hey @nukeador What are the next steps now? We talked about a possible process with the excel review sheets and I think it sounds like quite some work but doable. Reviewing a few thousand sentences for statistically sound results is still better than reviewing 500k sentences. Is it okay if I just prepare such a sheet for the German corpus and we see how it goes? Maybe you can explain the process again to the group so that everyone knows what we are talking about.
Yes, let’s kick-off the process we talked in private and see how it goes. Once that’s done we can share back with everyone and see how to do it with other languages.
Thanks!
Alright, I will start to review the sentences, everyone who wants to help finds the link to the sheet here:
Thanks to the great help of @benekuehn and other helpers from the german forum the 4000 sentences are reviewed now. 94.25% are fine, 2.10% have spelling errors, most of them are caused by the german spelling reform that happened in 1996. Another 3.05% are hard to pronounce, mainly names and political words.
Hey all,
What is the status on this effort at this point?
For the German import the review is done and this is the pull request waiting to be merged or refused:
We need to get green light from our team in charge of dataset quality as well as legal review to be fully sure we can use this content under CC0.
The pull request is mereged now:
How Did you solve the attribution @nukeador ? Are you now generally ready to import more languages?
Now that the wiki-scraper is capable of filtering all kind of sentence collections it should be easy to import more languages from this corpus.
Yes, I’m working with @phirework to have this merged and attributed
EDIT: This is now merged.
@stergro can we get an extraction for other languages we can ping communities to do the QA?
@stergro - we added a note in the README for this specific source: https://github.com/mozilla/voice-web/blob/master/README.md#licensing-and-content-source
For other sources we’ll need to get legal to do a case-by-case review of their licenses to see how we want to handle it, but for now feel free to keep pulling things from Europarl.
If I find the time I will prepare a PR for English and Spanish.
Looks like a good solution to me
Planning one more pass on the English wiki text and then I’ll start on Europarl for English.
Great, I won’t start with any of this before next week, so feel free to be quicker
Given that things seem settled on this, I went ahead and added a subset of the Dutch sentences, following the guidelines used for German, and adding a couple of other restrictions myself.
The pull request is there: https://github.com/mozilla/voice-web/pull/2643
Merging this is especially relevant for Dutch, because Common Voice has, by now, multiple recordings of all existing sentences already. More diversity would really be nice, and this is a good measure to get there, while the Wikipedia thing gets sorted out.