Sentence Collector need help to remove

Hello

Users in uzbek while reading sentences corpus have noticed many sentences with a lot of mistakes. The source of those texts are Telegram chats.

I suspect that the same user has two or more accounts with which he\she approves the sentences automatically.

Is there anyway to remove all sentences that was uploaded by that user or by the source?

(upload://b3Mtm7BML6WwlNLbho2ixiY4o9B.jpeg)

1 Like

Thanks for reporting this. I took a random example from your list, and checked its source.

Here’s the full list of sentences that got submitted with the source "Telegram public chats ": https://commonvoice.mozilla.org/sentence-collector/sentences/uz?source=Telegram%20public%20chats . These are 115k sentences, so I’m a bit hesitant to just remove all of that.

While the emoji ones probably indeed should be removed, I have absolutely no idea about the rest. What I’d like to avoid is also deleting a lot of valid sentences. What we could do here is do a sample review of a part of these 115k sentences with a 95% confidence interval and 2% of error margin. That would be roughly 2300 sentences. That would be the same process as used when importing bigger data sources.

Here’s the validation list: https://docs.google.com/spreadsheets/d/1g1xhw0MooOnRMULb4bRHbEdR2ZaN7XuzyLYMsgthyLM/edit#gid=0 (don’t want to give everyone edit access, so please request it and I can give it to you).

Does that approach sound good to you?

Thanks
Michael

1 Like

Hello Michael!

I couldn’t open this link, I’m getting 502 Bad Gateway server eror: https://commonvoice.mozilla.org/sentence-collector/sentences/uz?source=Telegram%20public%20chats .

I requested edit access for this sheet of words to be reviewed. Can you please approve it?

Yeah, that query might time out. I shared edit access with you on the doc.

1 Like

Hello Michael!
I have finished reviewing this file and we have 87% result of correct sentences. Is this a bit low quality? What shall we do now to remove bad sentences or we remove all texts from this source?

1 Like

Thanks for doing this! 87% correct sentences is in my opinion way too high to remove the full source. We’d be removing a lot of sentences that would be correct.

Instead let’s think about improvements. I think removing the sentences with emojis and also in the future not allow sentences with emojis to be uploaded would be good for all languages.

Apart from that, did you notice any patterns in the wrong sentences that we might be able to reject automatically? I’d be happy to help out, but not knowing the language at all I’d need help defining the rules.

1 Like

Alright. I guess we will need help from our Uzbek community to report occuring cases.
I noticed only 2 patterns in all wrong sentences:

  • contains cyrillic(russian) characters (this regex should match Russian letters [а-яА-Я])
  • starts with lowercase and contains only 1 word

Can we find and remove these kind of sentences and also those containing emojis ?

Yes, I think we can do that. Could you give me one valid and one invalid example for each of those so I can better test it?

Examples of right senteces:

U borishga biroz tortindi.
Hozir partiya tashkilotchimiz
Bo‘ron to‘xtagan bo‘lsa edi


Examples of wrong senteces:

hammasi [only one word, starts with lowercase and has no meaning as a sentence]
U borishga biroz tortindi :grin::sob::+1: [contains emojis]
U borishga tortindi бироз [contains russian characters]
У боришга бироз тортинди [contains russian characters]

Thank you! This has now been taken care of.

1 Like

One more thing I’ve just stumbled upon - it’s Uzbek sentences that contain one of these characters:
ў, ş, қ, ғ [none of these exist in Uzbek official alphabet today. I suppose these appeared as a result of converting old Uzbek cyrillic texts with wrong software]

For example, this sentence:
Lekin byudjet Erga kўzing kўk ʙўlmagani Gozel seni tўxtatişadi

So any sentence that contains at least one Russian character or ў/ş/қ/ғ should be deleted. Please help to resolve these cases.
Thank you!

Taken care of. I also added it to the validation so that new sentences in the Sentence Collector can’t be uploaded if they contain those characters.