Although not a problem with English Wikipedia, other sources quite often use strange characters to represent quotation marks, presumably to make the text look pretty. In case it’s of any use, here’s what I do to clean up:
# Clean up the base text, and simplify some of the weird quote marks
atext = re.sub('\s+', ' ', atext).strip() # replace multiple spaces with single; clean up linefeeds & tabs
atext = re.sub('[<>+*#@^/]', '', atext) # subst other non-allowed symbols with nulls.
atext = re.sub(u201b, u2018, atext)
atext = re.sub(u201f, u201d, atext)
atext = re.sub(uff02, u0022, atext)
atext = re.sub(u301d, u201c, atext)
atext = re.sub(u301e, u201d, atext)
atext = re.sub("n’t", "n't", atext) # Clean up eg "don't" where 'apostrophe' is actually a Right Single Quotation Mark
where
# Main symbols
u0027 = '\u0027' # ' APOSTROPHE [upright; can be used as single quote]
u0022 = '\u0022' # " QUOTATION MARK [upright]
u2018 = '\u2018' # ‘ LEFT SINGLE QUOTATION MARK
u2019 = '\u2019' # ’ RIGHT SINGLE QUOTATION MARK [sometimes used as an apostrophe]
u201c = '\u201c' # “ LEFT DOUBLE QUOTATION MARK
u201d = '\u201d' # ” RIGHT DOUBLE QUOTATION MARK
# Substituted before use
u201b = '\u201b' # ‛ SINGLE HIGH-REVERSED-9 QUOTATION MARK
u201f = '\u201f' # ‟ DOUBLE HIGH-REVERSED-9 QUOTATION MARK
uff02 = '\uff02' # "FULLWIDTH QUOTATION MARK
u301d = '\u301d' # 〝REVERSED DOUBLE PRIME QUOTATION MARK
u301e = '\u301e' # 〞DOUBLE PRIME QUOTATION MARK
One issue that’s not easy to solve is the use of Right Single Quotation Mark for Apostrophe, and vice versa. A single sentence may, for example, have two Right Single Quotation Marks, and it would take some work to sort whether those are unmatched quotation marks (invalid) or apostrophes (valid).