One of the things we are realizing when getting sentences from big sources (like wikipedia) is that, in general, they contain complex words (or foreign ones) that are not optimal for a user experience when reading on the app.
To tackle this issue we have been evaluating filtering them from known lists of most common words from a specific language, removing all foreign words and reducing most complex words we don’t use in day to day speaking.
The issue is that, applying this, greatly reduces the number of available sentences from certain sources, and we still need to have around 1,8M for 2000 hours.
An alternative here has been using vectors to generate new sentences from the existing ones, just replacing one word with other that are usually together the previous/next one.
I would like to open this conversation to get feedback from experts in the field, working with sentences in different languages. What would you consider ideal here? Are there other options we are not considering to balance the sentences we can get vs the ones we need?