Cleaning up sentence corpus

I’ve done some of the work for English here: https://github.com/dabinat/cvtools/blob/master/sentence_validator.py

I guess you could take a corpus with a low volume of foreign words, build up a list of 2, 3 and 4 letter sequences and then put ones that don’t appear (or appear at a low frequency) on a blacklist.

I have a script to calculate word usage here:

If you let me know what the total number of words should be (or the minimum number of occurrences), I could build up a list for English pretty quickly. The script should work fine with other languages too.

I’m testing this with the 1,48M Spanish sentences I have, and it seems words with accents (like “después”) are then displayed without the accent (“despus”).

Also word inside single quotes are identified as unique words

‘tirano prófufo’
‘trastear’

Here I get:

‘tirano 1
prófugo’ 1
‘transtear’ 1

Moving to a new topic to avoid noise in the other conversation, since this is more practical.

I’ve done

python3 word_usage.py -i wiki.es.txt >> usage.es.txt
awk '$2 ~ /^1$/' usage.es.txt >> uniques.es.txt

According to this there are 212388 unique words, most of them are weird terms or non-native terms. Fixing the single quote issue would reduce this number. I think it would be super safe to remove all sentences with these words (even the ones with 2, 3, 4 and even 5 repetition are complex or weird, from the samples I’ve seen).

Additional math:

Repetitions No. of words Sentences affected
1 212388 212388
2 55068 110136
3 26208 78624
4 15560 62240
5 10523 52615
Total 319747 516003

Removing all these sentences would give us 970574 out of the total 1486577 extracted. Having in mind that on avg. these take 5s to record each, this would give us 1348 hours (versus 2064 hours by using all of them).

Thanks, I updated it and it should work better now (make sure to run with Python 3, not 2). I also added a --limit option to limit it to the top X words.

One thing worth noting is that some sentences have inconsistent symbols - i.e. some have “curly” quotes while others have “straight” quotes; some have M dashes, others have N dashes, etc. So you might get results for “you’ll” and also “youll” where the curly quote gets stripped out. I personally consider this an issue with validation upon import and not a bug with the word_usage script. I filed it as a Sentence Collector bug a while ago but I guess it really applies to all imports, not only those through SC.

I just updated the script to convert curly apostrophes to straight and there’s also now a --min-frequency option to only show words with at least x matches.

1 Like

Oh wow, with the new version I’m getting

433102 words with no repetitions.
112300 words with 2 repetitions.

That’s a lot, and it would mean removing at least 657702 sentences. We should probably think if there is a better way to do this, it seems we are catching flies with a shotgun.

Probably searching with letters that can’t be together in my language + maybe searching for words longer than X, it would be a better way (at least for Spanish).

Yes, I think that’s the best way. Low-frequency long words are probably names of places or people, especially if they begin with a capital letter.

Added some extra arguments - --max-frequency and --show-words-only. So you can now easily create a blacklist of the least popular words.

Cool. Which is the script to do the letters filtering? I would like to compare results.

BTW, this are the full exports from wikipedia in Spanish and German, in case you want to play.

It’s all in the same repo (sentence_validator.py). That one’s very English-specific though so I’d recommend forking it for other languages.

Although not a problem with English Wikipedia, other sources quite often use strange characters to represent quotation marks, presumably to make the text look pretty. In case it’s of any use, here’s what I do to clean up:

# Clean up the base text, and simplify some of the weird quote marks
atext = re.sub('\s+', ' ', atext).strip() # replace multiple spaces with single; clean up linefeeds & tabs
atext = re.sub('[<>+*#@^/]', '', atext)  # subst other non-allowed symbols with nulls.
atext = re.sub(u201b, u2018, atext)
atext = re.sub(u201f, u201d, atext)
atext = re.sub(uff02, u0022, atext)
atext = re.sub(u301d, u201c, atext)
atext = re.sub(u301e, u201d, atext)
atext = re.sub("n’t", "n't", atext)  # Clean up eg "don't" where 'apostrophe' is actually a Right Single Quotation Mark

where

# Main symbols
u0027 = '\u0027'  # ' APOSTROPHE [upright; can be used as single quote]
u0022 = '\u0022'  # " QUOTATION MARK [upright]
u2018 = '\u2018'  # ‘ LEFT SINGLE QUOTATION MARK
u2019 = '\u2019'  # ’ RIGHT SINGLE QUOTATION MARK  [sometimes used as an apostrophe]
u201c = '\u201c'  # “ LEFT DOUBLE QUOTATION MARK
u201d = '\u201d'  # ” RIGHT DOUBLE QUOTATION MARK

# Substituted before use
u201b = '\u201b'  # ‛ SINGLE HIGH-REVERSED-9 QUOTATION MARK
u201f = '\u201f'  # ‟ DOUBLE HIGH-REVERSED-9 QUOTATION MARK
uff02 = '\uff02'  # "FULLWIDTH QUOTATION MARK
u301d = '\u301d'  # 〝REVERSED DOUBLE PRIME QUOTATION MARK
u301e = '\u301e'  # 〞DOUBLE PRIME QUOTATION MARK

One issue that’s not easy to solve is the use of Right Single Quotation Mark for Apostrophe, and vice versa. A single sentence may, for example, have two Right Single Quotation Marks, and it would take some work to sort whether those are unmatched quotation marks (invalid) or apostrophes (valid).

@carlfm01 check this out for the efforts we are currently doing for Spanish :slight_smile:

I would like to mention that we have been testing spacy for the Spanish sentences, to get a list of vocabulary which is smarter and reduces our sample way less (226K instead of 550K for words < 5 repetition):

I don’t know if this is useful for all languages or if it’s worth to integrate directly on the scrapper script.

Interesting… I will look into that.

To be honest, I feel that all of these methods should eventually be part of the wiki scraper. The problem with doing it afterwards is that all filtered sentences are removed entirely, whereas the scraper can easily replace a bad sentence with a different one so the quantity ends up the same. The wiki scraper does not have to worry about over-filtering the sentences.

Do you think there will be an opportunity in future to re-run the wiki scraper for English to make up for the ones that were filtered out? The challenge would be making sure we’re not adding more than three sentences from any page, but I guess you could do a pass across the page first and count how many sentences are matched from the previous wiki dump.

That’s unknown to me since I don’t know if we have the ability to extract the exact same valid sentences as we did before. @gregor @fiji and @mkohler know better the code.

Well it’s supposed to be random so I don’t know if that will be possible. You’d have to build up a list of all sentences on the page, then count how many match in the previous wiki dump. If it’s three or more, move onto another page, otherwise extract whatever will take the total sentences up to three.

It’s probably not going to be super fast but there may not be another way without knowing which URL each sentence originally came from. There should be an extra sidecar file like Sentence Collector’s JSON file that lists the URL where each sentence came from.

No, it’s random, so there is no way to do that.

Yep, that would take quite a long time. There might be some optimizations that could be done, but not sure how legal that is (for example saying that the chance of a sentence being in 2 articles is small).

Oh dear, that will indeed be rather slow if no record of the URL was kept for each sentence. Rather than trying to locate the URL for every single sentence in the Wikipedia extraction, it might be quicker to do that only for the sentences that already have validated recordings, and simply re-generate the rest. So . . .

  • Throw away all the WP sentences apart from those that have validated recordings
  • Re-extract a completely new collection, based on 3 sentences per article. Use @dabinat’s filtering as part of the scraper, and get a new sentence immediately if any one of the three fails.
  • Seach the current article to see if it contains any of the existing validated sentences. If so, drop one of the new ones to ensure no more than three per article.
  • Record the URL of each sentence, and of any existing validated sentence that’s found.

If you use Python, you could use the num2words module to replace any figures with spelled-out words.

@carlfm01 Do you have time this week to extract a list of less used words (<5) for German and Italian wikipedias too? (Or point us to how you did it for Spanish)

@mkohler and @Mte90 would appreciate and I’m asking for dev time to implement a blacklist file directly on the extractor.