Sentence Collector need help to remove

Thanks for reporting this. I took a random example from your list, and checked its source.

Here’s the full list of sentences that got submitted with the source "Telegram public chats ": https://commonvoice.mozilla.org/sentence-collector/sentences/uz?source=Telegram%20public%20chats . These are 115k sentences, so I’m a bit hesitant to just remove all of that.

While the emoji ones probably indeed should be removed, I have absolutely no idea about the rest. What I’d like to avoid is also deleting a lot of valid sentences. What we could do here is do a sample review of a part of these 115k sentences with a 95% confidence interval and 2% of error margin. That would be roughly 2300 sentences. That would be the same process as used when importing bigger data sources.

Here’s the validation list: https://docs.google.com/spreadsheets/d/1g1xhw0MooOnRMULb4bRHbEdR2ZaN7XuzyLYMsgthyLM/edit#gid=0 (don’t want to give everyone edit access, so please request it and I can give it to you).

Does that approach sound good to you?

Thanks
Michael

1 Like