Sentence Collector need help to remove

Hello

Users in uzbek while reading sentences corpus have noticed many sentences with a lot of mistakes. The source of those texts are Telegram chats.

I suspect that the same user has two or more accounts with which he\she approves the sentences automatically.

Is there anyway to remove all sentences that was uploaded by that user or by the source?

(upload://b3Mtm7BML6WwlNLbho2ixiY4o9B.jpeg)

1 Like

Thanks for reporting this. I took a random example from your list, and checked its source.

Here’s the full list of sentences that got submitted with the source "Telegram public chats ": https://commonvoice.mozilla.org/sentence-collector/sentences/uz?source=Telegram%20public%20chats . These are 115k sentences, so I’m a bit hesitant to just remove all of that.

While the emoji ones probably indeed should be removed, I have absolutely no idea about the rest. What I’d like to avoid is also deleting a lot of valid sentences. What we could do here is do a sample review of a part of these 115k sentences with a 95% confidence interval and 2% of error margin. That would be roughly 2300 sentences. That would be the same process as used when importing bigger data sources.

Here’s the validation list: https://docs.google.com/spreadsheets/d/1g1xhw0MooOnRMULb4bRHbEdR2ZaN7XuzyLYMsgthyLM/edit#gid=0 (don’t want to give everyone edit access, so please request it and I can give it to you).

Does that approach sound good to you?

Thanks
Michael

1 Like

Hello Michael!

I couldn’t open this link, I’m getting 502 Bad Gateway server eror: https://commonvoice.mozilla.org/sentence-collector/sentences/uz?source=Telegram%20public%20chats .

I requested edit access for this sheet of words to be reviewed. Can you please approve it?

Yeah, that query might time out. I shared edit access with you on the doc.

1 Like

Hello Michael!
I have finished reviewing this file and we have 87% result of correct sentences. Is this a bit low quality? What shall we do now to remove bad sentences or we remove all texts from this source?

1 Like

Thanks for doing this! 87% correct sentences is in my opinion way too high to remove the full source. We’d be removing a lot of sentences that would be correct.

Instead let’s think about improvements. I think removing the sentences with emojis and also in the future not allow sentences with emojis to be uploaded would be good for all languages.

Apart from that, did you notice any patterns in the wrong sentences that we might be able to reject automatically? I’d be happy to help out, but not knowing the language at all I’d need help defining the rules.

1 Like