Cleaning up sentence corpus

I just updated the script to convert curly apostrophes to straight and there’s also now a --min-frequency option to only show words with at least x matches.

1 Like

Oh wow, with the new version I’m getting

433102 words with no repetitions.
112300 words with 2 repetitions.

That’s a lot, and it would mean removing at least 657702 sentences. We should probably think if there is a better way to do this, it seems we are catching flies with a shotgun.

Probably searching with letters that can’t be together in my language + maybe searching for words longer than X, it would be a better way (at least for Spanish).

Yes, I think that’s the best way. Low-frequency long words are probably names of places or people, especially if they begin with a capital letter.

Added some extra arguments - --max-frequency and --show-words-only. So you can now easily create a blacklist of the least popular words.

Cool. Which is the script to do the letters filtering? I would like to compare results.

BTW, this are the full exports from wikipedia in Spanish and German, in case you want to play.

It’s all in the same repo (sentence_validator.py). That one’s very English-specific though so I’d recommend forking it for other languages.

Although not a problem with English Wikipedia, other sources quite often use strange characters to represent quotation marks, presumably to make the text look pretty. In case it’s of any use, here’s what I do to clean up:

# Clean up the base text, and simplify some of the weird quote marks
atext = re.sub('\s+', ' ', atext).strip() # replace multiple spaces with single; clean up linefeeds & tabs
atext = re.sub('[<>+*#@^/]', '', atext)  # subst other non-allowed symbols with nulls.
atext = re.sub(u201b, u2018, atext)
atext = re.sub(u201f, u201d, atext)
atext = re.sub(uff02, u0022, atext)
atext = re.sub(u301d, u201c, atext)
atext = re.sub(u301e, u201d, atext)
atext = re.sub("n’t", "n't", atext)  # Clean up eg "don't" where 'apostrophe' is actually a Right Single Quotation Mark

where

# Main symbols
u0027 = '\u0027'  # ' APOSTROPHE [upright; can be used as single quote]
u0022 = '\u0022'  # " QUOTATION MARK [upright]
u2018 = '\u2018'  # ‘ LEFT SINGLE QUOTATION MARK
u2019 = '\u2019'  # ’ RIGHT SINGLE QUOTATION MARK  [sometimes used as an apostrophe]
u201c = '\u201c'  # “ LEFT DOUBLE QUOTATION MARK
u201d = '\u201d'  # ” RIGHT DOUBLE QUOTATION MARK

# Substituted before use
u201b = '\u201b'  # ‛ SINGLE HIGH-REVERSED-9 QUOTATION MARK
u201f = '\u201f'  # ‟ DOUBLE HIGH-REVERSED-9 QUOTATION MARK
uff02 = '\uff02'  # "FULLWIDTH QUOTATION MARK
u301d = '\u301d'  # 〝REVERSED DOUBLE PRIME QUOTATION MARK
u301e = '\u301e'  # 〞DOUBLE PRIME QUOTATION MARK

One issue that’s not easy to solve is the use of Right Single Quotation Mark for Apostrophe, and vice versa. A single sentence may, for example, have two Right Single Quotation Marks, and it would take some work to sort whether those are unmatched quotation marks (invalid) or apostrophes (valid).

@carlfm01 check this out for the efforts we are currently doing for Spanish :slight_smile:

I would like to mention that we have been testing spacy for the Spanish sentences, to get a list of vocabulary which is smarter and reduces our sample way less (226K instead of 550K for words < 5 repetition):

I don’t know if this is useful for all languages or if it’s worth to integrate directly on the scrapper script.

Interesting… I will look into that.

To be honest, I feel that all of these methods should eventually be part of the wiki scraper. The problem with doing it afterwards is that all filtered sentences are removed entirely, whereas the scraper can easily replace a bad sentence with a different one so the quantity ends up the same. The wiki scraper does not have to worry about over-filtering the sentences.

Do you think there will be an opportunity in future to re-run the wiki scraper for English to make up for the ones that were filtered out? The challenge would be making sure we’re not adding more than three sentences from any page, but I guess you could do a pass across the page first and count how many sentences are matched from the previous wiki dump.

That’s unknown to me since I don’t know if we have the ability to extract the exact same valid sentences as we did before. @gregor @fiji and @mkohler know better the code.

Well it’s supposed to be random so I don’t know if that will be possible. You’d have to build up a list of all sentences on the page, then count how many match in the previous wiki dump. If it’s three or more, move onto another page, otherwise extract whatever will take the total sentences up to three.

It’s probably not going to be super fast but there may not be another way without knowing which URL each sentence originally came from. There should be an extra sidecar file like Sentence Collector’s JSON file that lists the URL where each sentence came from.

No, it’s random, so there is no way to do that.

Yep, that would take quite a long time. There might be some optimizations that could be done, but not sure how legal that is (for example saying that the chance of a sentence being in 2 articles is small).

Oh dear, that will indeed be rather slow if no record of the URL was kept for each sentence. Rather than trying to locate the URL for every single sentence in the Wikipedia extraction, it might be quicker to do that only for the sentences that already have validated recordings, and simply re-generate the rest. So . . .

  • Throw away all the WP sentences apart from those that have validated recordings
  • Re-extract a completely new collection, based on 3 sentences per article. Use @dabinat’s filtering as part of the scraper, and get a new sentence immediately if any one of the three fails.
  • Seach the current article to see if it contains any of the existing validated sentences. If so, drop one of the new ones to ensure no more than three per article.
  • Record the URL of each sentence, and of any existing validated sentence that’s found.

If you use Python, you could use the num2words module to replace any figures with spelled-out words.

@carlfm01 Do you have time this week to extract a list of less used words (<5) for German and Italian wikipedias too? (Or point us to how you did it for Spanish)

@mkohler and @Mte90 would appreciate and I’m asking for dev time to implement a blacklist file directly on the extractor.

I’m limited this week, to generate the list with the spacy vocabulary column need to use my fork of dabinat’s tool : https://github.com/carlfm01/cvtools/tree/spacy

To use it for other languages, need to install the target language (please use the md versions): https://spacy.io/usage/models

Then need to change the spacy model that we want to use : https://github.com/carlfm01/cvtools/blob/9533a318cd63cd7967fa18dab8ac215fdc9c7da9/word_usage.py#L104

Finally the generated file contains three columns : word frequency outOfSpacyVocab, reading and filtering from this file is up to you at the moment.

Interesting fact, I found a word frequency file for the whole English wikipedia here

And doing a superficial analysis, any word under 900 repetitions is complex, weird or non-native. @gregor and I are currently trying to get this word usage for all sentences in German and Spanish wikipedia and probably the number of repetitions to use might vary.

@nukeador How does the new Report button affect cleaning of the English wiki sentences? Should they still be cleaned through scripts or are we now relying on users to flag them?

I think both options are valid. The more people reporting the better, I think the plan is to run regular reports with the flagged sentences, and maybe we can automate their removal somehow.

/cc @mbranson @gregor

I’m facing more complexity when analyzing word frequency.

There are some words that have high number of repetitions just because they are repeated in 1-2 articles hundreds of times, and we don’t really have a way to track that.

I’m doing some tests for Spanish, avoiding words with 80 repetitions or less, I’m getting very very low number of sentences, and still some are invalid