today @FremyCompany imported 60k sentences in the Dutch sentence collector using transcribed sentences from speeches from the EU Parliament. The dataset uses Speeches from 1996 - 2011 and is available in many European languages. You can read about the details here: http://www.statmt.org/europarl/
You can also read some more details about the Dutch dataset and how it was selected on Slack. The most important part is:
I only selected sentences in that dataset which:
- were between 5 and 8 words long
- were not longer than 55 characters
- started with an uppercase and ended with a dot
- did not contain parentheses or semi-colons/double-points
- did not have any capitalized word to the exception of the one starting the sentence
The last restriction is rather strict but it makes the sentences way less topically biased and avoids having to deal with proper names for the most part, which I guess will make the review process smoother.
Given the desired sentence length (about 5s), I would expect there will be little variation in all languages, so you should expect to find between 50k and 100k sentences for all of the 21 languages represented in the set.
The goal seems to be 1M sentences for each language, so this might get you at around 10% of the requirements for any of those languages.
There are however 2M sentences for Dutch, but most of these sentences are way too long. Finding a way to cut them into smaller chunks or training a language model on this dataset and generating new short sentences based on it might help get all the 21 languages across the line.
I experimented with the German sentences and after some filtering I now have a file containing more than a hundred thousand sentences ready to be used. Maybe the number will decline a little when I filter out more unsuitable sentences. But before I put all these sentences into the sentence collector I have some questions:
- Should we really check these sentences with the standard process? At least two people have to read every sentence, this would occupy the sentence collector for quite some time.
- How many sentences can be put into the sentence collector at once? Do I have to cut them into chunks?
- Do you guys in general think that using this data is a good idea in this way?
I believe this could increase the diversity of the sentences in the big languages a lot. Right now they all have the same dry style from Wikipedia, we could add more natural sentences from speeches in many languages with this data. And this is also a great chance to increase the numbers for some of the smaller European languages that don’t have a big Wikipedia.