While the emoji ones probably indeed should be removed, I have absolutely no idea about the rest. What I’d like to avoid is also deleting a lot of valid sentences. What we could do here is do a sample review of a part of these 115k sentences with a 95% confidence interval and 2% of error margin. That would be roughly 2300 sentences. That would be the same process as used when importing bigger data sources.
Hello Michael!
I have finished reviewing this file and we have 87% result of correct sentences. Is this a bit low quality? What shall we do now to remove bad sentences or we remove all texts from this source?
Thanks for doing this! 87% correct sentences is in my opinion way too high to remove the full source. We’d be removing a lot of sentences that would be correct.
Instead let’s think about improvements. I think removing the sentences with emojis and also in the future not allow sentences with emojis to be uploaded would be good for all languages.
Apart from that, did you notice any patterns in the wrong sentences that we might be able to reject automatically? I’d be happy to help out, but not knowing the language at all I’d need help defining the rules.
U borishga biroz tortindi.
Hozir partiya tashkilotchimiz
Bo‘ron to‘xtagan bo‘lsa edi
Examples of wrong senteces:
hammasi [only one word, starts with lowercase and has no meaning as a sentence]
U borishga biroz tortindi [contains emojis]
U borishga tortindi бироз [contains russian characters]
У боришга бироз тортинди [contains russian characters]
One more thing I’ve just stumbled upon - it’s Uzbek sentences that contain one of these characters: ў, ş, қ, ғ [none of these exist in Uzbek official alphabet today. I suppose these appeared as a result of converting old Uzbek cyrillic texts with wrong software]
For example, this sentence: Lekin byudjet Erga kўzing kўk ʙўlmagani Gozel seni tўxtatişadi
So any sentence that contains at least one Russian character or ў/ş/қ/ғ should be deleted. Please help to resolve these cases.
Thank you!