Some questions on cv-sentence-extractor

Hey @mkohler, I’m trying cv-sentence-extractor for Turkish. There are some more-or-less technical questions.

Q1. As far as I can see, when set, the broken_whitespace rule leaves sentences with sub-strings like " !" filtered out. Shouldn’t it fix them instead?

Q2. I could not see any construct (e.g. regex) to replace multiple spaces, do I miss it (Rust is new to me)?

Q3. I’m having trouble with the apostrophe’s. I used cvtools's as instructed to get the word frequencies. I used it with --strip-apostrophes setting. In Turkish, proper names with suffixes are divided by them, like [Bülent’in], [Bülent’e], [Bülent’lerden] etc. (Although there is a “clean” function which replaces possible Unicode alternatives, some pass through…)

This part is most important as I use dictionaries and stemmers to white-list words to get a black-list (implemented in an external repo).

How can I handle similar stuff in cv-sentence-extractor (wrt. apostophes)?

Thanks for creating this thread.

Then you would have to provide how it needs to be fixed and then there is not much difference to using replacements which allows for way more flexibility.

Yes, that currently does not exist. Especially for Wikipedia I doubt that that is necessary. Do you encounter a lot of that? I can see how a double space can slip through, but more than that should be caught by editorial processes? For the double space using a replacement might be a good option instead.

I’m not sure if @dabinat is still active?

Not sure I understand this question, can you elaborate?

Thank you for the answers @mkohler

Yes, it is one possibility. I saw many language rules use them, and I thought it is for fixing, until I see the code. I don’t understand (for the use in Common Voice) why would a sentence like “Where are you ?” would cause a problem. Thus the question…

It is a common practice. At least the replace part could introduce such doubles…

I could not find the reason by looking at the code. I run it in WSL2, so it is not Windows Unicode problem either (I first suspected İ => i and I => ı conversion problems which is common).

Suppose I want to black-list “Bülent”, but there is no option to separate by apostrophes in cv-sentence-connector. So, Bülent’in etc will pass though…

Btw, I handled #1 & #2 with with replacements and with empty broken_whitespace setting.

Trying to find a solution for #3 now…

1 Like

Oh, it was in the wrong order in the code:

Perhaps it should be:

  • replace apostrophes (convert curly ones)
  • remove them
  • clean (without convert curly ones)

Edit: Yeah, that fixed it…

Ah, yes. That currently does not exist. If this would be implemented, I’d suggest to keep it very generic so it would also work for other languages.

I think an optional flag, default off would do it. Similar to cvtools. It would only be temporary for token checking.

But the same problem will exist, we need to replace alternative apostrophes temporarily to be able to split.

Only those who want this will enable this, as per the readme…

Here comes my 35th language :frowning:

Hey @mkohler, what is the strategy on parenthesis?

They are not wanted by Common Voice, but I see many rule sets which allow them. (I do not.)

My question about “How much can we replace?” was related to this.

There are many cases where the original/latin name/pronunciation etc is given in parenthesis and many sentences get filtered out because of this. Those words are usually also on the blacklist.

I was thinking of eliminating them from source by a rule, if possible… E.g. from the English article on Istanbul:

The city was founded as Byzantium (Greek: Βυζάντιον, Byzantion ) in the 7th century BCE by Greek settlers from Megara.

This could be converted to:

The city was founded as Byzantium in the 7th century BCE by Greek settlers from Megara.

There are way too many such occurrences and it would help with low-resourced Wikipedia languages where many articles only have a couple of sentences (mostly with parenthesis).

There is nothing special. And as long as there are several different places where validation is done, it will not be possible to do a common approach for this.

That being said, if it helps to have such a rule specifically for the sentence extractor, then why not. But then that possibly also depends on the answer to your question you linked in your post.