Remove all sentences in sentence collector for Ukrainian

Somebody submitted a lot of sentences for Ukrainian which are all inappropriate.
I’ve reviewed it but there is nothing to approve.
Is it possible to clear them all?

The second question is where can I find already approved sentences to have a look at it?

Thank you!

First of all, you can find the approved and exported sentences here: https://github.com/mozilla/voice-web/blob/master/server/data/uk/sentence-collector.txt

Approximately how many sentences are we talking about? And can you give me your username?

For my future reference: the deletion query would need to delete all not-approved sentences, but not those that Artem has already approved!

Hi Michael,

Currently there are 9289 unreviewed sentences.
My username on sentence collector service is artem and my GitHub username is a-polivanchuk.

Thanks for the link! I quickly went through the list and see many incorrect sentences as well which sounds absolutely not natural. I guess they should be removed too.
Can I clean it up and create the PR later?

Thanks. Do I understand correctly that you went through all 9000 se tences and there are none left that warrant an approval?

Can you give me some examples of these sentences and explain why they are not natural?

For the approved sentences a PR is not enough, as the next export will just export them again. We also would need to delete those in the Sentence Collector database. For that I have a script if you could give me a text file with all the sentences to delete, line by line. Running an export after that will also delete them from the sentence-collector.txt file.

Yes, that’s what I mean. All the sentences are typically similar and related to some political discussion. There are even sentences in Russian. It looks like the list was just copied and pasted without any additional processing and reviewing.

Examples:

  1. Порошенко так не робив як Ви, Порошенко телефон у мене не забирав. (Mentioned the ex-president’s surname and regarding the mobile phone)
  2. Вот этот шаг Вы можно делать сейчас без Верховной Рады. (This sentence is absurd and it is in Russian)
  3. Ми говоримо про повагу, Олег Валерійович, давайте триматися поваги. (Mentioned first-name and surname of some politician and grammatically incorrect)
  4. Я просто говорю як пропозицію. (Grammatically incorrect and not natural)
  5. Перепрошую, одну секундочку, тому що дійсно друге читання. (Not natural and truncated context)

At first, I tried to catch and approve good sentences, but then realized it’s a waste of time.

Got it! Regarding already approved sentences, I’ll prepare the txt file and provide it to you when it’s ready.

Thanks.

@nukeador do you agree with going ahead and deleting all unreviewed sentences that have not yet been approved by Artem?

If all of them are confirmed to be inappropriate yes. Just for the sake of process, can we get this confirmation from other Ukrainian speaker we know?

Thanks!

I do confirm that all these sentences are some political bullshit.

WBR,
Iurii Klekovkin.

Can we get some examples?

Are they bad sentences or just politically conflictive?

They are just having bad grammar or was said by person with A2/B1 level of language proficiency.

Порошенко так не робив як Ви, Порошенко телефон у мене не забирав.

↓ Native speaker would say something like this (guessing as sentence meaning is not clear at all)

Порошенко так як Ви не робив, наприклад телефона у мене не забирав.

§

Вот этот шаг Вы можно делать сейчас без Верховной Рады.

Yes, this was said in Russian (and with mistakes in Russian).

§

Ми говоримо про повагу, Олег Валерійович, давайте триматися поваги.
↓ First part sound weird, but the last part is just wrong.
Ми говоримо про повагу, Олег Валерійович, давайте поважати один одного.

§

Я просто говорю як пропозицію.
↓ Grammatically incorrect
Я висуваю пропозицію.

§

Перепрошую, одну секундочку, тому що дійсно друге читання.
↓ Just guessing meaning, but it seems it should be like:
Перепрошую, хвилиночку! Це дійсно друге читання.

It’s interesting that we got 9K of wrong sentences.

@mkohler if all of them are confirmed to be wrong, yes, let’s delete them.

If you all could confirm that all pending sentences can be deleted, I will add a script that does that, with the following requirements:

  • All pending sentences up to Thursday November 7th, 18:30 CET (right now)
  • Not voted positively by user “Artem” (should I add any other user filter?)
  • Not fully approved yet

And then for the already approved sentences, this applies:

If you’re referencing to these ones (a bit more than 1k), then yes — most of them are truncated or sound weird.

I think it’s because most of these sentences are taken from unedited (and without any proofreading or even formatting) transcripts of working sessions instead of organized public speaking. See an attached file as an example (found from “Чи немає необхідності в доповіді і обговоренні, колеги? Ніхто не наполягає?” string).

From 16.05.2019.doc.pdf (720.8 KB)

Hi Sasha,

Thank you for your time and help reviewing this issue!

Hi @artem

I can share with you my experience with my community (Kabyle, a minor language) to collect more sentences and recruit more contribs.
It’s better to look for graduated people from language departments as I did. Onsite workshops/Speeches about CV ans SC can also help to give assitance. But, Social Communication (pages, blogs, videos, …) and sometimes traditional media (TV, newspapers, radio…) is the best way to reach more people.

We are waiting for more Ukrainian contribution.
:slight_smile:

@artem can you confirm that this is still the criteria? Should the date be extended to today?

With these requirements I’m getting 6592 records to delete. When extending the date to today I’m getting 6672, which is the total unreviewed sentences. Just to make sure: Are there sentences not approved by “artem”, but would be valid?

Please extend the date to today.
At the moment keep my approvals only.
Thanks :slight_smile:

Update: @mkohler please add user Oleg to the filter.
He just contacted me today. I’ll additionally review his sentences after clean up.

Thanks for your patience. This is now done, and I’ve also deleted all the sentences mentioned in https://github.com/mozilla/voice-web/pull/2499. However, as some sentences had been approved in the mean time, this resulted in the following diff:

Feel free to go through these and tell me which ones to delete. A text file with one sentence per line to delete would be perfect, as I have a script for that. For the new sentences, please also give a quick indication on why they should be deleted if needed.

Got it. Thank you Michael for the agile support!