Sentence Extractor - Current Status and Workflow Summary

Blito · July 25, 2020, 6:09am

Currently we can’t re-run exports for these languages. One idea to look into is to re-run it for articles that were created after the export date.

Hi Michael,

I was wondering how that plays out with the single sentence record limit feature that was announced. There’s some math here that suggests that a language needs 2000+ hours of recordings, which at a rate of 4 seconds/clip is at least 1.8M sentences that now have to be unique.

Am I thinking about this problem correctly? If so, it means that this tool will become much more critical.

I was looking at Spanish as an example, and I see that there are 10836 sentences here. Are those all the sentences that were added to the database using the sentence-extractor? Is there a way to see how many sentences are there in the database total (including those that were manually added using the sentence-collector)?

I see the export date for Spanish is 2020-06-11. When you said:

One idea to look into is to re-run it for articles that were created after the export date.

Does that mean that we can’t extract any more sentences from Wikipedia articles that were created before that date?

I saw there’s a cap on 3 sentences per article (here). Assuming there are 30k sentences in the Spanish database (10k submitted through the extractor + 20k? through the manual collector), that means that we still need 1.77M (1.8M - 30k) sentences. At 3 per article, that’s 590k articles. I found this, and it looks like Spanish Wikipedia added 100k new articles from Dec2017 to Dec2018 (latest data available there). We would need 5+ years of new articles to have enough data lol.

Sorry for all the questions. I really like the idea of the project. I’ve been using Rust at work for a while now and figured I could contribute here. I’m trying to understand where it fits in the broader project, and whether it’s something worth investing more time in.

Thanks!