Until now I kept myself away from Wiki* and thus Sentence Extractor, but I’m getting out of resources (sigh)…
I scanned some random samples in Turkish “Vikipedi” and found many of them are out of topic for Common Voice, has many foreign names, chemical substance names, short entries giving a list (e.g. a football players games) etc.
Here, I see many tools such as blacklists and/or vocabulary, but as far as I can see that would need a considerable time investment and trial-error to produce good results.
We have around 500k entries on “Vikipedi”, which could result 1.5 M sentences, but scanning them is nearly impossible with current manpower… And if quality sentences come out, that would solve half of our problems for years to come.
I want to hear from those who used this process:
- Do you get good results if you invest the time?
- Which parts of the rules are most important?
- Can you change a “bad sentence” with a better one and/or exclude that sentence?
- Any other advice?
PS: @mkohler forwarded me to ask the question here…