Hi,
For the past few months I’ve come across a large number of Italian sentences which are mostly made up of foreign proper nouns – in extreme cases a single foreign name makes up the whole entry. They all seem to come from this file here: https://github.com/mozilla/voice-web/blob/master/server/data/it/wiki.it.txt
A few examples:
Donnie Boyce
Fred Vinson
By the Way
Robert Smith
Mickey Shaughnessy
L'intero task group venne insignito della Presidential Unit Citation per l'azione bellica
T'o-ur nde "Reino"!
Tó-ñe-moñang nde r-emi-motara yby-pe.
Frequentò la Syracuse University e la Columbia University.
For many of these, it’s clear from the recordings that people are confused and don’t know how to pronounce the foreign words.
Perhaps these cases could be removed by running language id on wiki.it.txt
. I also suspect many of these come from either Wikipedia section names or table entries, and so modifying the extraction script to avoid these parts could be helpful, if it doesn’t do so already.