These are great advances @txopi
You might be able to play with the rules files and the blacklist to avoid Roman ordinals. Other people in this topic would be able to help with the regex.
Once you have a set of rules and blacklist that produce an output that is rated as <7% error rate by 2-3 native speakers, feel free to open a PR adding the following information:
- How many sentences are you getting?
- How did you create the blacklist? (specify the criteria, i.e words with <80 repetitions)
- Get 2-3 additional native speakers (ideally some linguistics) to comment here with the estimated error rate. You can share with them a few samples of 500 random sentences from your output.
Cheers.