Sentence Collector need help to remove

Ben_Khasanov · November 23, 2021, 5:53am

Hello

Users in uzbek while reading sentences corpus have noticed many sentences with a lot of mistakes. The source of those texts are Telegram chats.

I suspect that the same user has two or more accounts with which he\she approves the sentences automatically.

Is there anyway to remove all sentences that was uploaded by that user or by the source?

(upload://b3Mtm7BML6WwlNLbho2ixiY4o9B.jpeg)

mkohler · November 23, 2021, 6:30pm

Thanks for reporting this. I took a random example from your list, and checked its source.

Here’s the full list of sentences that got submitted with the source "Telegram public chats ": https://commonvoice.mozilla.org/sentence-collector/sentences/uz?source=Telegram%20public%20chats . These are 115k sentences, so I’m a bit hesitant to just remove all of that.

While the emoji ones probably indeed should be removed, I have absolutely no idea about the rest. What I’d like to avoid is also deleting a lot of valid sentences. What we could do here is do a sample review of a part of these 115k sentences with a 95% confidence interval and 2% of error margin. That would be roughly 2300 sentences. That would be the same process as used when importing bigger data sources.

Here’s the validation list: https://docs.google.com/spreadsheets/d/1g1xhw0MooOnRMULb4bRHbEdR2ZaN7XuzyLYMsgthyLM/edit#gid=0 (don’t want to give everyone edit access, so please request it and I can give it to you).

Does that approach sound good to you?

Thanks
Michael

Abbosjon_Kudratov · November 24, 2021, 5:43am

Hello Michael!

I couldn’t open this link, I’m getting 502 Bad Gateway server eror: https://commonvoice.mozilla.org/sentence-collector/sentences/uz?source=Telegram%20public%20chats .

I requested edit access for this sheet of words to be reviewed. Can you please approve it?

mkohler · November 24, 2021, 10:08am

Yeah, that query might time out. I shared edit access with you on the doc.

Abbosjon_Kudratov · November 29, 2021, 5:37am

Hello Michael!
I have finished reviewing this file and we have 87% result of correct sentences. Is this a bit low quality? What shall we do now to remove bad sentences or we remove all texts from this source?

mkohler · November 30, 2021, 5:53pm

Thanks for doing this! 87% correct sentences is in my opinion way too high to remove the full source. We’d be removing a lot of sentences that would be correct.

Instead let’s think about improvements. I think removing the sentences with emojis and also in the future not allow sentences with emojis to be uploaded would be good for all languages.

Apart from that, did you notice any patterns in the wrong sentences that we might be able to reject automatically? I’d be happy to help out, but not knowing the language at all I’d need help defining the rules.

Abbosjon_Kudratov · December 3, 2021, 2:01pm

Alright. I guess we will need help from our Uzbek community to report occuring cases.
I noticed only 2 patterns in all wrong sentences:

contains cyrillic(russian) characters (this regex should match Russian letters [а-яА-Я])
starts with lowercase and contains only 1 word

Can we find and remove these kind of sentences and also those containing emojis ?

mkohler · December 3, 2021, 5:58pm

Yes, I think we can do that. Could you give me one valid and one invalid example for each of those so I can better test it?

Abbosjon_Kudratov · December 6, 2021, 6:53am

Examples of right senteces:

U borishga biroz tortindi.
Hozir partiya tashkilotchimiz
Bo‘ron to‘xtagan bo‘lsa edi

Examples of wrong senteces:

hammasi [only one word, starts with lowercase and has no meaning as a sentence]
U borishga biroz tortindi [contains emojis]
U borishga tortindi бироз [contains russian characters]
У боришга бироз тортинди [contains russian characters]

mkohler · December 7, 2021, 6:41pm

Thank you! This has now been taken care of.

Abbosjon_Kudratov · December 9, 2021, 5:40am

One more thing I’ve just stumbled upon - it’s Uzbek sentences that contain one of these characters:
ў, ş, қ, ғ [none of these exist in Uzbek official alphabet today. I suppose these appeared as a result of converting old Uzbek cyrillic texts with wrong software]

For example, this sentence:
Lekin byudjet Erga kўzing kўk ʙўlmagani Gozel seni tўxtatişadi

So any sentence that contains at least one Russian character or ў/ş/қ/ғ should be deleted. Please help to resolve these cases.
Thank you!

mkohler · December 9, 2021, 4:43pm

Taken care of. I also added it to the validation so that new sentences in the Sentence Collector can’t be uploaded if they contain those characters.

Topic		Replies	Views
Remove all sentences in sentence collector for Ukrainian Common Voice sentence-collection	19	1155	December 27, 2019
Remove all sentences in sentence collector for Abkhazian Common Voice sentence-collection	6	563	October 18, 2019
Need help with batch deleting (300k+) in sentence collector inappropriate samples Common Voice sentence-collection	3	641	April 11, 2021
Question about CV Sentence Extractor quality and your experience Common Voice	18	1612	August 30, 2023
Sentence collection for Belarusian Common Voice sentence-collection	8	1905	July 7, 2020

Sentence Collector need help to remove

Related topics