How to report bad sentences that made it past the sentence collector?

While validating content for Italian, I have come across a lot of low quality sentences that seems to have made it past the sentence collector.

Should I hit no on those? Skip? I imagine it would be better if they could be taken out of circulation entirely, so that users don’t waste their time speaking them.

Some examples:

  • Lots of foreign technical terms, some of them highly specific to software development (did someone upload technical documentation?). Even for more common terms, like open source, I’ve noticed that about half the speakers say /sɔɹs/ as if it were English , and half say /surs/, mimicking French. Surely this can’t be good for the dataset?
  • Lots of foreign first and last names. For some, e.g. Bob or Obama, pronunciation will be obvious to everyone (?). Others, like Schoenberner or Veheran get mangled in various inconsistent ways.
  • Several sentences appear to be taken from a fantasy novel. They contain what must be made-up names. These look nothing like Italian words, and their pronunciation isn’t at all obvious, like Zipak (/'dzipak/? /tsi'pak/? /'zaipek/? no clue) or Bonard (pronounce it like English? French?)
  • Very repetitive sentences. I must have come across at least 10 sentences from a novel which involve someone called De Vincenzi.

Can I give someone a list of these sentences I come across?

1 Like

Hi @Jean

Thanks for sharing this issue.

Can you paste the full sentences here so we can check where are they coming from?

Thanks!

Hi @nukeador! I’ll start collecting the odd sentences I see, sure. In fact, it will just be faster if I look at the dataset directly, rather than collect the sentences as I come across them.

I’ve downloaded the dataset (it_40h_2019-06-12) and have extracted all the sentences via

for i in *.tsv; do cut -f3 $i >> sentences.txt; done

(Is this a good way of going about it, or is there a more up-to-date list of the sentences elsewhere?)

That ends up being 16k+ sentences after deduplication, which I don’t have time to manually inspect. So I’m thinking of narrowing this list down by running language identification, to find potentially problematic words which are not Italian. I can also find things like weird punctuation, and use spellcheck to identify typos etc.

I can then manually inspect the potentially bad sentences that have been identified with the above procedure, and provide you with a list of the ones which I feel are truly problematic.

Would this kind of work be useful? (I imagine the automated part can be easily applied to other languages too, not just Italian)

All validated sentences for Italian are hosted here:

There is also metadata from the sentence collector ones, to understand the author and the reviewers.

Cheers.

1 Like

I’ve done a first pass and collected the sentences which I find questionable from the Italian dataset. In my opinion, these should be removed. You can find them here: https://gist.github.com/jeanm/62479bebc304f414c0e6b0364186db25

This was done semi-automatically, with spellcheck and language identification + manual review.