Currently we can’t re-run exports for these languages. One idea to look into is to re-run it for articles that were created after the export date.
Hi Michael,
I was wondering how that plays out with the single sentence record limit feature that was announced. There’s some math here that suggests that a language needs 2000+ hours of recordings, which at a rate of 4 seconds/clip is at least 1.8M sentences that now have to be unique.
Am I thinking about this problem correctly? If so, it means that this tool will become much more critical.
I was looking at Spanish as an example, and I see that there are 10836 sentences here. Are those all the sentences that were added to the database using the sentence-extractor
? Is there a way to see how many sentences are there in the database total (including those that were manually added using the sentence-collector)?
I see the export date for Spanish is 2020-06-11. When you said:
One idea to look into is to re-run it for articles that were created after the export date.
Does that mean that we can’t extract any more sentences from Wikipedia articles that were created before that date?
I saw there’s a cap on 3 sentences per article (here). Assuming there are 30k sentences in the Spanish database (10k submitted through the extractor + 20k? through the manual collector), that means that we still need 1.77M (1.8M - 30k) sentences. At 3 per article, that’s 590k articles. I found this, and it looks like Spanish Wikipedia added 100k new articles from Dec2017 to Dec2018 (latest data available there). We would need 5+ years of new articles to have enough data lol.
Sorry for all the questions. I really like the idea of the project. I’ve been using Rust at work for a while now and figured I could contribute here. I’m trying to understand where it fits in the broader project, and whether it’s something worth investing more time in.
Thanks!