[Technical feedback needed] Wikipedia extractor script beta

Ran the Dutch version overnight. Blacklist generation set to 70. Added some extra less used characters to avoid foreign words.
Finally ended up with 1.25 million sentences!
Ran a quick test and I think these are for the majority good sentences. Luckily not that many abbreviations (e.g. 4.000 for ‘ca.’) and with so many sentences we could decide to just put them in the rules with disallowed symbols.
I think I am going to run it once again with a blacklist generation set to 100 and known abbreviations.
To be continued, but this is very hopeful.

2 Likes