Future of the Sentence Extractor - Your input is required

Hi everyone

The Sentence Extractor has been around for some time and was used to extract sentences from Wikipedia for several languages. While this process works for some, it doesn’t for others. As of right now I’m seeing the following issues we might want to address:

  • It doesn’t work well for certain languages where rust-punkt does not correctly segment sentences due to languages not using periods to separate sentences or due to abbreviations not being recognized correctly.
  • Contributors interested in doing an extract for their language need to do quite a few steps to get their extract incorporated - which also needs quite some technical knowledge

Given that there are still Wikipedias for languages that haven’t been leveraged, I want to start a discussion on how you would like to see this process working out. Additionally there are other sources this process could be used for.

Would be great to have a discussion here around the following question:

In a perfect world, how would you expect the flow to work to extract sentences from sources like Wikipedia?

Note that in the end we will still need to run the export to make sure the legal requirements are met, but anything before that is up for improvement.


Here are my thoughts:

  • The more technical details we can abstract, the easier it is for somebody to use it
  • Validation currently happens in a spreadsheet - this could be improved with a common, guided process
  • We really need to fix the issue with it not working for quite some languages we’d eventually want to work

Picking up older ideas and parts of what @ftyers told me, I’ve created the following diagram:

What this would allow to do:

  • Easy configuration of rules via GUI without having to run a lot of tools locally with a preview of how the rules apply to a sample set of sentences
  • Making sure segmentation works for a given language - though with more technical effort needed (not necessarily by the same person as configuring the rules)
  • Guided review process to keep validation easy and high quality
  • Guided submission once validation is done (I’m not super happy with still needing a GitHub account in that process)
  • Once the PR is merged the same process as currently kicks in

I’m a bit torn on the amount of work this would need to get to the finishing line. Is it worth it given that we currently mostly have Wikipedia as a source?

Looking forward to hearing other ideas from all of you!


Since we already have a list of aproved sentences from Wikipedia, I think a good approach would be to train a classifier for valid or invalid sentence, that’s what I would do, it has more room for scaling for multiple languages, moving for a machine learning method seems the logical step. What do you think about that?

I like your ideas a lot! A graphical interface to create a rule file would make things a lot easier as a first step.

Another thing that I would really love to see is some way to extract more sentences for languages that already did the extraction from Wikipedia. Most versions of Wikipedia are gaining tenths of thousands of new articles every year, so it might be legal to extract only from articles that are newer than the last extraction date, right?

WikiExtractor seems to extract the articles in the order of creation date, so it might be something easy to implement if we are lucky.

1 Like

Generally I’m definitely not against that. However I have not enough knowledge around languages in general and Machine Learning, so I’d rather defer to somebody who knows more (maybe @ftyers has thoughts on this).

My understanding is that this should be possible, but Legal would need to confirm before we do that. I’ve tried to capture that in the diagram in the box on the bottom right.

1 Like

Classifiers are good, but we should think about the following:

  • What is the training data, what is the feature representation
  • What balance do we want with precision vs. recall
    • For example: It’s easy to get high precision by discarding everything outside the alphabet, and maybe that’s all that is needed (if we can only take 3 sents part article anyway)
  • What is the cost/benefit of implementing a few simple rules (like no sentences longer than 10 tokens, no symbols outside the alphabet + punctuation) vs. implementing a classifier
  • I think that in general these rules are about as scalable as any classifier because we have community involvement, if they can translate the interface, they can tell us what the alphabet of the language is and what punctuation it uses (or we can use covo).
  • Anyway, my suggestion is start simple and then add complexity where needed, rather than start complex and get bogged down in computationally expensive models.
1 Like

We probably should take at least a week old articles, since then most vandal articles are deleted.

1 Like

Maybe not directly the extraction, but a step after it.

(or this can be about the extraction as well, see below)

Would love to see a prioritisation step to put sentences into the queue for reading/reviewing, to maximise the net contribution to the model.

For example, say we have these six sentences, some sharing their tokens:

  1. Aaa bbbb cccc
  2. Aaa bbbb cccc ddd
  3. Ccc ddd
  4. Aaa bbbb cccc eee
  5. Bbbb cccc eee
  6. Aaa eee fff

Ideally, we like to have all six sentences to be read and reviewed. This target may be get reached eventually, but it takes time.

Prioritisation of the sentences, to be feed into the reading/reviewing queue, could help us get a closer-to-optimum output in a same given amount of labor.

Traditionally, we might go by temporal order, based on timestamp of submission (by the collector):
1, 2, 3, 4, 5, 6
in which we have until the last sentence to be read/reviewed to get all tokens.

We can sort it by sentence length:
2, 4, 1, 5, 6, 3
but for this, the user experience is probably not very good (user will be continuously presented by longer sentences).

We can try to do some diffs, and magically get:
2, 6, 3, 4, 1, 5
where we can get all the tokens in the second sentence, and we gradually get more samples of the same tokens as we get more sentences.

I’m not entirely sure about the proposed pipeline, but this prioritisation/reordering could happen towards the very end when new sentences get exported.

It might be get more complex if it has to also consider the previously extracted sentences (reviewed and to be reviewed).

This prioritisation could be in the end related to extraction.

As we trying to extract limited number of sentences (3) from a single article, we should trying to extract the sentence that contain tokens/combination of tokens that the current database don’t have.


Some previous discussions, for reference:


I’ve got another proposal.

During our review of extract from belarusian wikipedia we found that it might be useful to have another rule that will control mean word length of a sentence.

Consider following sentences that were extracted for belarusian (all have word count < max_word_count == 14):

sentence #words mean word length
Зазнаў уплывы антычнага дойлідства, венецыянскага барочнага тэатральна-дэкарацыйнага мастацтва. 8 10.75
Асвятляюцца навіны вытворчасці, грамадска-палітычнага, сацыяльна-культурнага і спартыўнага жыцця. 8 10.875
Працаваў таксама ў галіне манументальна-дэкаратыўнай керамікі, габелена, тэатральна-дэкарацыйнага мастацтва, кніжнай ілюстрацыі. 11 10.36

They all are pretty hard to pronounce because they consist of complicated words. Of course we want our model to recognize them but such sentences may cause troubles when recording.

Regular sentences have lower mean word lenght:

sentence #words mean word length
У цэнтры другога паверха быў зроблены балкон, аздоблены каванай агароджай. 10 6.3
Да гэтага ж часу адносіцца пачатак яго выкладчыцкай дзейнасці. 9 5.89
Як змяніўся горад за гэты час? 6 4

I am attaching a link to the image with exact distribution of mean word length that has been uploaded to the Pull Request comment in cv-sentence-extractor git repo.

I couldn’t upload image to this message because of some credential issue. (The error message: “missing credentials, provide credentials with one of the following options:”. )

So it seems, this feature might be very useful for further wiki extracts.
What do you think?

1 Like

Thanks. I wonder if anyone would have opinions on how that could possibly work with languages that are not so similar and possibly don’t have words, or don’t use spaces to separate. Would be interesting to see if there is an approach that would work for other languages as well.