Future of the Sentence Extractor - Your input is required

mkohler · April 3, 2021, 11:57am

Hi everyone

The Sentence Extractor has been around for some time and was used to extract sentences from Wikipedia for several languages. While this process works for some, it doesn’t for others. As of right now I’m seeing the following issues we might want to address:

It doesn’t work well for certain languages where rust-punkt does not correctly segment sentences due to languages not using periods to separate sentences or due to abbreviations not being recognized correctly.
Contributors interested in doing an extract for their language need to do quite a few steps to get their extract incorporated - which also needs quite some technical knowledge

Given that there are still Wikipedias for languages that haven’t been leveraged, I want to start a discussion on how you would like to see this process working out. Additionally there are other sources this process could be used for.

Would be great to have a discussion here around the following question:

In a perfect world, how would you expect the flow to work to extract sentences from sources like Wikipedia?

Note that in the end we will still need to run the export to make sure the legal requirements are met, but anything before that is up for improvement.

mkohler · April 3, 2021, 12:05pm

Here are my thoughts:

The more technical details we can abstract, the easier it is for somebody to use it
Validation currently happens in a spreadsheet - this could be improved with a common, guided process
We really need to fix the issue with it not working for quite some languages we’d eventually want to work

Picking up older ideas and parts of what @ftyers told me, I’ve created the following diagram:

What this would allow to do:

Easy configuration of rules via GUI without having to run a lot of tools locally with a preview of how the rules apply to a sample set of sentences
Making sure segmentation works for a given language - though with more technical effort needed (not necessarily by the same person as configuring the rules)
Guided review process to keep validation easy and high quality
Guided submission once validation is done (I’m not super happy with still needing a GitHub account in that process)
Once the PR is merged the same process as currently kicks in

I’m a bit torn on the amount of work this would need to get to the finishing line. Is it worth it given that we currently mostly have Wikipedia as a source?

Looking forward to hearing other ideas from all of you!

Codigo_Logo_Programacao_e_Inteligencia_Artificial · April 5, 2021, 7:55am

Since we already have a list of aproved sentences from Wikipedia, I think a good approach would be to train a classifier for valid or invalid sentence, that’s what I would do, it has more room for scaling for multiple languages, moving for a machine learning method seems the logical step. What do you think about that?

stergro · April 5, 2021, 12:40pm

I like your ideas a lot! A graphical interface to create a rule file would make things a lot easier as a first step.

Another thing that I would really love to see is some way to extract more sentences for languages that already did the extraction from Wikipedia. Most versions of Wikipedia are gaining tenths of thousands of new articles every year, so it might be legal to extract only from articles that are newer than the last extraction date, right?

WikiExtractor seems to extract the articles in the order of creation date, so it might be something easy to implement if we are lucky.

mkohler · April 5, 2021, 1:04pm

Generally I’m definitely not against that. However I have not enough knowledge around languages in general and Machine Learning, so I’d rather defer to somebody who knows more (maybe @ftyers has thoughts on this).

mkohler · April 5, 2021, 1:05pm

My understanding is that this should be possible, but Legal would need to confirm before we do that. I’ve tried to capture that in the diagram in the box on the bottom right.

ftyers · April 5, 2021, 2:39pm

Classifiers are good, but we should think about the following:

What is the training data, what is the feature representation
What balance do we want with precision vs. recall
- For example: It’s easy to get high precision by discarding everything outside the alphabet, and maybe that’s all that is needed (if we can only take 3 sents part article anyway)
What is the cost/benefit of implementing a few simple rules (like no sentences longer than 10 tokens, no symbols outside the alphabet + punctuation) vs. implementing a classifier
I think that in general these rules are about as scalable as any classifier because we have community involvement, if they can translate the interface, they can tell us what the alphabet of the language is and what punctuation it uses (or we can use covo).
Anyway, my suggestion is start simple and then add complexity where needed, rather than start complex and get bogged down in computationally expensive models.

Oymate · April 9, 2021, 4:38am

We probably should take at least a week old articles, since then most vandal articles are deleted.

bact · April 15, 2021, 5:52pm

Maybe not directly the extraction, but a step after it.

(or this can be about the extraction as well, see below)

–

Would love to see a prioritisation step to put sentences into the queue for reading/reviewing, to maximise the net contribution to the model.

For example, say we have these six sentences, some sharing their tokens:

Aaa bbbb cccc
Aaa bbbb cccc ddd
Ccc ddd
Aaa bbbb cccc eee
Bbbb cccc eee
Aaa eee fff

Ideally, we like to have all six sentences to be read and reviewed. This target may be get reached eventually, but it takes time.

Prioritisation of the sentences, to be feed into the reading/reviewing queue, could help us get a closer-to-optimum output in a same given amount of labor.

Traditionally, we might go by temporal order, based on timestamp of submission (by the collector):
1, 2, 3, 4, 5, 6
in which we have until the last sentence to be read/reviewed to get all tokens.

We can sort it by sentence length:
2, 4, 1, 5, 6, 3
but for this, the user experience is probably not very good (user will be continuously presented by longer sentences).

We can try to do some diffs, and magically get:
2, 6, 3, 4, 1, 5
where we can get all the tokens in the second sentence, and we gradually get more samples of the same tokens as we get more sentences.

–

I’m not entirely sure about the proposed pipeline, but this prioritisation/reordering could happen towards the very end when new sentences get exported.

It might be get more complex if it has to also consider the previously extracted sentences (reviewed and to be reviewed).

–

This prioritisation could be in the end related to extraction.

As we trying to extract limited number of sentences (3) from a single article, we should trying to extract the sentence that contain tokens/combination of tokens that the current database don’t have.

bact · April 18, 2021, 5:06am

Some previous discussions, for reference:

github.com/common-voice/cv-sentence-extractor

Improve sentence separation

opened 05:27PM - 24 Apr 19 UTC

closed 11:45AM - 18 Jul 21 UTC

dabinat

bug P1 extract-improvements punkt-issue

There are a lot of partial sentences such as: > is applied to species he desc…ribed. > a fully documented genealogy e.g. > are verbs. These seem to occur commonly after an abbreviation with a full stop / period after it or something like "i.e." or "e.g.". It seems that the script is interpreting the full stops as the end of the sentence. A way to improve this would be to only split up the sentence if the next character after the full stop is a capital letter.

github.com/common-voice/cv-sentence-extractor

Adding Thai rules for CV Sentence Extractor

common-voice:main ← thainetizen:main

opened 03:45AM - 15 Apr 21 UTC

bact

+165 -0

`th.toml`: - `other_patterns` borrowing from: https://github.com/common-voice/…sentence-collector/blob/main/server/lib/validation/languages/th.js (BEGIN_REGEX, END_REGEX, STRUCTURE_REGEX, and ABBREVIATION_REGEX with few adjustments) - `replacements` borrowing from: https://github.com/common-voice/sentence-collector/blob/main/server/lib/cleanup/languages/th.js (with some adjustments) - `min_word_count` and `max_word_count` are set on the basis of treating "word" as "a group of character between two whitespaces/punctuations", since currently there's no Thai word tokenization in the extractor. This will close #133 ## How many sentences did you get at the end? 478 ## How did you create the blocklist file? Since the current tokenizer does not work well with a language using no space as a word delimiter, cvtools seems doesn't work, so I haven't create one. ## Review / Error ratio (from 184 samples) Category | % -- | -- OK | 88 A: Spelling is not correct | 1 B: Grammar is not correct | 0 C: It's not easily speakable (including uncommon non-native words)| 1 D: Other | 10 "D" are mostly sentences with a "dangling word" in the beginning (it is meant to be a last word in the previous sentence). Since the total number of the sentences I have is just below 500, and the suggested amount of random sample is "100-500", I'm not sure if the amount of sentences I have is just unexpectedly low or not. Would like to clarify this before I ask more people for review. (I may have to "relax" the rules, but still not sure if this related to the way the punkt sentence tokenizer works or not). The extracted sentences are here: https://docs.google.com/spreadsheets/d/1pKBH_YQiO9ZdXIduvrb37HvCLlKBt8mGeDCpX8e8dT4/edit?usp=sharing ## Questions Does the original number of articles in Wikipedia also affect the number of extracted output as well? Tried to extract all the articles, without rules applying, with this command: ```sh cargo run -- extract -l th -d ../wikiextractor/text/ --no_check >> wiki.th.all.txt ``` Got this ```sh $ wc -l wiki.th.* 1985699 wiki.th.all.txt 478 wiki.th.txt ``` We actually have a lot of lines extracted in `wiki.th.all.txt` (1,314,274 lines after blank lines removed), but looks like these "sentences" are tend to be very long. In fact, a lot of lines contains more than one sentence (can be a whole paragraph). And the longer the line/sentence is, the more likely that it will got hit by one of the disallowing rules. Few sample lines from `wiki.th.all.txt` (applying no rules): - ดาราศาสตร์เป็นหนึ่งในสาขาของวิทยาศาสตร์ที่เก่าแก่ที่สุด นักดาราศาสตร์ในวัฒนธรรมโบราณสังเกตการณ์ดวงดาวบนท้องฟ้าในเวลากลางคืน และวัตถุทางดาราศาสตร์หลายอย่างก็ได้ถูกค้นพบเรื่อยมาตามยุคสมัย อย่างไรก็ตาม กล้องโทรทรรศน์เป็นสิ่งประดิษฐ์ที่จำเป็นก่อนที่จะมีการพัฒนามาเป็นวิทยาศาสตร์สมัยใหม่ ตั้งแต่อดีตกาล ดาราศาสตร์ประกอบไปด้วสาขาที่หลากหลายเช่น การวัดตำแหน่งดาว การเดินเรือดาราศาสตร์ ดาราศาสตร์เชิงสังเกตการณ์ การสร้างปฏิทิน และรวมทั้งโหราศาสตร์ แต่ดาราศาสตร์ทุกวันนี้ถูกจัดว่ามีความหมายเหมือนกับฟิสิกส์ดาราศาสตร์ ตั้งแต่คริสต์ศตวรรษที่ 20 เป็นต้นมา ดาราศาสตร์ได้แบ่งออกเป็นสองสาขาได้แก่ ดาราศาสตร์เชิงสังเกตการณ์ และดาราศาสตร์เชิงทฤษฎี ดาราศาสตร์เชิงสังเกตการณ์จะให้ความสำคัญไปที่การเก็บและการวิเคราะห์ข้อมูล โดยการใช้ความรู้ทางกายภาพเบื้องต้นเป็นหลัก ส่วนดาราศาสตร์เชิงทฤษฎีให้ความสำคัญไปที่การพัฒนาคอมพิวเตอร์หรือแบบจำลองเชิงวิเคราะห์ เพื่ออธิบายวัตถุท้องฟ้าและปรากฏการณ์ต่าง ๆ ทั้งสองสาขานี้เป็นองค์ประกอบซึ่งกันและกัน กล่าวคือ ดาราศาสตร์เชิงทฤษฎีใช้อธิบายผลจากการสังเกตการณ์ และดาราศาสตร์เชิงสังเกตการณ์ใช้ในการรับรองผลจากทางทฤษฎี - เมื่อสังคมมีวิวัฒนาการขึ้นในดินแดนต่าง ๆ การสังเกตการณ์ทางดาราศาสตร์ก็ซับซ้อนมากขึ้น โดยเฉพาะอย่างยิ่งใน เมโสโปเตเมีย กรีก จีน อียิปต์ อินเดีย และ มายา เริ่มมีแนวคิดเกี่ยวกับความสัมพันธ์ของธรรมชาติแห่งจักรวาลกว้างขวางขึ้น ผลการศึกษาดาราศาสตร์ในยุคแรก ๆ จะเป็นการบันทึกแผนที่ตำแหน่งของดวงดาวต่าง ๆ อันเป็นศาสตร์ที่ปัจจุบันเรียกกันว่า การวัดตำแหน่งดาว (astrometry) ผลจากการเฝ้าสังเกตการณ์ทำให้แนวคิดเกี่ยวกับการเคลื่อนที่ของดวงดาวต่าง ๆ เริ่มก่อตัวเป็นรูปร่างขึ้น ธรรมชาติการเคลื่อนที่ของดวงอาทิตย์ ดวงจันทร์ และโลก นำไปสู่แนวคิดเชิงปรัชญาเพื่อพยายามอธิบายปรากฏการณ์เหล่านั้น ความเชื่อดั้งเดิมคือโลกเป็นศูนย์กลางของจักรวาล โดยมีดวงอาทิตย์ ดวงจันทร์ และดวงดาวต่าง ๆ เคลื่อนที่ไปโดยรอบ แนวคิดนี้เรียกว่า แบบจำลองแบบโลกเป็นศูนย์กลางจักรวาล (geocentric model) - เคปเลอร์ได้คิดค้นระบบแบบใหม่ขึ้นโดยปรับปรุงจากแบบจำลองเดิมของโคเปอร์นิคัส ทำให้รายละเอียดการโคจรต่าง ๆ ของดาวเคราะห์และดวงอาทิตย์ที่ศูนย์กลางสมบูรณ์ถูกต้องมากยิ่งขึ้น แต่เคปเลอร์ก็ไม่ประสบความสำเร็จในการนำเสนอทฤษฎีนี้เนื่องจากกฎหมายในยุคสมัยนั้น จนกระทั่งต่อมาถึงยุคสมัยของเซอร์ ไอแซค นิวตัน ผู้คิดค้นหลักกลศาสตร์ท้องฟ้าและกฎแรงโน้มถ่วงซึ่งสามารถอธิบายการเคลื่อนที่ของดาวเคราะห์ได้อย่างสมบูรณ์ นิวตันยังได้คิดค้นกล้องโทรทรรศน์แบบสะท้อนแสงขึ้นด้วย - ไม่ควรสับสนระหว่างดาราศาสตร์โบราณกับโหราศาสตร์ ซึ่งเป็นความเชื่อที่นำเอาเหตุการณ์และพฤติกรรมของมนุษย์ไปเกี่ยวโยงกับตำแหน่งของวัตถุท้องฟ้า แม้ว่าทั้งดาราศาสตร์และโหราศาสตร์เกิดมาจากจุดร่วมเดียวกัน และมีส่วนหนึ่งของวิธีการศึกษาที่เหมือนกัน เช่นการบันทึกตำแหน่งดาว (ephemeris) แต่ทั้งสองอย่างก็แตกต่างกัน I guess if we can make the lines shorter, we can get more extracted sentences in `wiki.th.txt` Need some suggestions here. Thank you.

Aliaksandr · May 27, 2021, 6:27pm

Hi!

I’ve got another proposal.

During our review of extract from belarusian wikipedia we found that it might be useful to have another rule that will control mean word length of a sentence.

Consider following sentences that were extracted for belarusian (all have word count < max_word_count == 14):

sentence	#words	mean word length
Зазнаў уплывы антычнага дойлідства, венецыянскага барочнага тэатральна-дэкарацыйнага мастацтва.	8	10.75
Асвятляюцца навіны вытворчасці, грамадска-палітычнага, сацыяльна-культурнага і спартыўнага жыцця.	8	10.875
Працаваў таксама ў галіне манументальна-дэкаратыўнай керамікі, габелена, тэатральна-дэкарацыйнага мастацтва, кніжнай ілюстрацыі.	11	10.36

They all are pretty hard to pronounce because they consist of complicated words. Of course we want our model to recognize them but such sentences may cause troubles when recording.

Regular sentences have lower mean word lenght:

sentence	#words	mean word length
У цэнтры другога паверха быў зроблены балкон, аздоблены каванай агароджай.	10	6.3
Да гэтага ж часу адносіцца пачатак яго выкладчыцкай дзейнасці.	9	5.89
Як змяніўся горад за гэты час?	6	4

I am attaching a link to the image with exact distribution of mean word length that has been uploaded to the Pull Request comment in cv-sentence-extractor git repo.

I couldn’t upload image to this message because of some credential issue. (The error message: “missing credentials, provide credentials with one of the following options:”. )

So it seems, this feature might be very useful for further wiki extracts.
What do you think?

mkohler · May 28, 2021, 6:52pm

Thanks. I wonder if anyone would have opinions on how that could possibly work with languages that are not so similar and possibly don’t have words, or don’t use spaces to separate. Would be interesting to see if there is an approach that would work for other languages as well.

Topic		Replies	Views
[Technical feedback needed] Wikipedia extractor script beta Common Voice sentence-collection , feedback	76	8406	July 1, 2020
Question about CV Sentence Extractor quality and your experience Common Voice	18	1554	August 30, 2023
Bulk sentences submission from Wikipedia Common Voice sentence-collection	4	585	August 12, 2024
About the new English Sentences Common Voice feedback , issue	37	3316	May 31, 2019
We want your feedback: Improving the sentence collection Common Voice sentence-collection , feedback	39	8866	January 9, 2019

Future of the Sentence Extractor - Your input is required

Related topics