Balancing most common words vs sentences number

Hi all,

One of the things we are realizing when getting sentences from big sources (like wikipedia) is that, in general, they contain complex words (or foreign ones) that are not optimal for a user experience when reading on the app.

To tackle this issue we have been evaluating filtering them from known lists of most common words from a specific language, removing all foreign words and reducing most complex words we don’t use in day to day speaking.

The issue is that, applying this, greatly reduces the number of available sentences from certain sources, and we still need to have around 1,8M for 2000 hours.

An alternative here has been using vectors to generate new sentences from the existing ones, just replacing one word with other that are usually together the previous/next one.

I would like to open this conversation to get feedback from experts in the field, working with sentences in different languages. What would you consider ideal here? Are there other options we are not considering to balance the sentences we can get vs the ones we need?



One simple idea would be to take a corpus of “everyday” language (or whichever is your target domain) and make a frequency list and basically exclude anything in Wikipedia which includes words beneath a certain rank. I’m not sure how well it would work in practice, but it would certainly be a reasonable baseline.

Foreign doesn’t always mean unpronounceable and some non-English terms are in common use among English speakers (e.g. a la carte, coup d’etat), so removing all of these may not be the best approach. Plus I think it’s good to have coverage of major country / city names.

A simple way of doing this would be to filter all foreign letter sequences and then have a whitelist for common words / phrases that might otherwise be stripped out.

However, what I’ve found when validating is that the best indicator of whether a user will get a sentence right is the overall complexity of the entire sentence, not necessarily the complexity of a specific word. So users do better on a sentence with one foreign word than they do with two, and the success rate for three or more seems to be pretty close to zero. (The same also applies to long words.)

So some kind of sentence scoring system would be better IMO. It could score sentences based on the number of words overall, the number of foreign letter sequences and the number of long words, especially the length of words containing foreign letter sequences. Such a system would be more forgiving of words like “Hawaii” which don’t obey traditional English rules but are in common use and easy for English speakers to pronounce.

(As a side note, I could have decreased the max length of a word on my validation script but chose not to because I felt it stripped out too many perfectly pronounceable words - a lot of which hit the length limit because they were extended by “ing” or “ed”. So maybe a word length algorithm should take into account the length before common suffixes.)

Is there a methodology to identify this per language or this is something we’ll have to get from an external source?

Another recommendation coming from Deep Speech team:

  1. Take Wikipedia text in language X
  2. Find all the unique words
  3. Sort them from the most frequently used to least frequently used.
  4. Take the top N words.
  5. Create a corpus from Wikipedia rejecting sentences that have words not in the N most frequent words.

Derivation and inflection for agglutinative languages can produce unique words within the same language and it’s the case for afroasiatic (hamito-semitic) languages: kabyle and other berber languages, hebrew, arabic, amharic…

In such a situation, we need more processings such as lemmatization/racinization, it takes more time/resources to process huge corpora, and adapted algorithms are needed for each language.

I was exploring a solution with ngrams (bigrams and trigrams) with NLTK.

1 Like

One of the things that I was observing for Spanish is that complex or too technical words usually appear in just one sentence. I suspect it’s safe to remove just sentences with unique words as an initial step in cleaning them up.

I’ll try with the kab corpora on Wikipedia and analyse the result. Kab wiki is still poor.

1 Like

3 posts were split to a new topic: Cleaning up sentence corpus

As I understand it, one of the problems you have is that when you filter an extracted Wikipedia corpus to remove sentences with rare/difficult/foreign words you find that you have to throw away a significant proportion of the data. And you can’t then easily get more because of the limit of three sentences per article.

Are the Wiki extractions done separately from the validity-checking? Maybe part is done by the Deep Speech team and part by the Common Voice team?

A better methodology would be to add the rare/difficult/foreign words filter to the code used for the primary extraction. If one of the three random sentences from a specific article fails, simply choose a replacement there and then. You then end up with a more-or-less maximal collection of sentences that you know in advance will be useful.

No, the process is currently just run by the Common Voice team, and we can include new rules in the extraction, it’s only that for quality testing it’s easier to filter the extracted ones because the process to generate a new extraction takes around 6h in my box.

Improving and understanding the right filtering will allow us to then implement it on the main extraction and hopefully get more sentences.