nukeador
(Rubén Martín [❌ taking a break from Mozilla])
July 7, 2020, 12:03pm
4
mytmpaccount2015:
I’ll also take a look at the sentence extractor – maybe it wouldn’t require much effort to scrape the Belarusian Wikipedia. I’m concerned about the license though – iirc, Wikipedia’s textual content is CC-BY-SA. If that was discussed before, I would appreciate if you point me at the relevant discussion.
We have a special legal approval for it, and it’s our main and first place we direct people for sentences (the most productive in terms of time investment). See the conditions and process over here:
What is the Sentence Extractor
Common Voice is Mozilla’s initiative to help teach machines how real people speak. For this we need to collect sentences that people can read out aloud on the website. Individual sentences can be submitted through the Sentence Collector. This only can scale so far, so we also the Sentence Extractor (formerly Wiki Scraper) to extract sentences from other sources.
Currently the only implemented source is Wikipedia. We are allowed to export a maximum of 3 sentences p…
1 Like