EDIT: Since this thread got bigger than expected here is a summary of the project, that I will update from time to time. You can still find the old first post below.
This thread is about using a speech corpus from the EU Parliament. The dataset uses Speeches from 1996 - 2011 and is available in 20 (?) European languages. You can read about the details here: http://www.statmt.org/europarl/
To import the dataset you will have to filter the sentences and then do a review process to assure the quality of the data. Here is an overview of the languages that have already done this and how they have done it:
Language | Merged | Number of sentences | Details |
---|---|---|---|
German | Yes | ± 370 000 | Pull Request German , my collection of shell commands to filter sentences and the spreadsheets for the QA process |
Danish | No | 5136 | Danish pull request and the Scripts for the danish sentences extraction |
Dutch | Yes | ±259 000 | Dutch pull request and the Dutch QA spreadsheet |
Czech | Yes | 98908 | Czech Pull request and the QA spreadsheet |
Polish | Yes | 205,024 | Polish pull request |
If you want to filter a new language you should have a look at the CV Sentence Extractor and see if it already supports your language.
For the QA Process @nukeador send these rules to me:
Manual review
After the automatic process most major issues should be solved, but there are always a few that can’t be fully automated and that’s why we need to ensure the following on the filtered output from the previous process:
- The sentences must be spelled correctly.
- The sentences must be grammatically correct.
- The sentences must be speakable (also avoiding non-native uncommon words)
Then 2-3 native speakers (ideally some linguistics) should review a random sample of sentences. The exact number depends on getting a valid sample size – you can use this tool with a 99% confidence interval and 2% margin of error. So, for 1,000,000 sentences, you would need a sample of 4,000.
Each reviewer should score each sentence on the dimensions above (points 1, 2 and 3), which will give you an indication of how many are passing the manual review.
You can use this spreadsheet template for a 100 sentences evaluation.
If the average error ratio from these people is below 5-7%, you can consider it good to go.
Please report following outputs to the Common Voice team:
- How many valid sentences did you get at the end?
- What is the error ratio?
- Who reviewed the sentences and a link to the review spreadsheet(s)?
Original post:
Hey everyone,
today @FremyCompany imported 60k sentences in the Dutch sentence collector using transcribed sentences from speeches from the EU Parliament. The dataset uses Speeches from 1996 - 2011 and is available in many European languages. You can read about the details here: http://www.statmt.org/europarl/
You can also read some more details about the Dutch dataset and how it was selected on Slack. The most important part is:
I only selected sentences in that dataset which:
- were between 5 and 8 words long
- were not longer than 55 characters
- started with an uppercase and ended with a dot
- did not contain parentheses or semi-colons/double-points
- did not have any capitalized word to the exception of the one starting the sentence
The last restriction is rather strict but it makes the sentences way less topically biased and avoids having to deal with proper names for the most part, which I guess will make the review process smoother.
Given the desired sentence length (about 5s), I would expect there will be little variation in all languages, so you should expect to find between 50k and 100k sentences for all of the 21 languages represented in the set.
The goal seems to be 1M sentences for each language, so this might get you at around 10% of the requirements for any of those languages.
There are however 2M sentences for Dutch, but most of these sentences are way too long. Finding a way to cut them into smaller chunks or training a language model on this dataset and generating new short sentences based on it might help get all the 21 languages across the line.
I experimented with the German sentences and after some filtering I now have a file containing more than a hundred thousand sentences ready to be used. Maybe the number will decline a little when I filter out more unsuitable sentences. But before I put all these sentences into the sentence collector I have some questions:
- Should we really check these sentences with the standard process? At least two people have to read every sentence, this would occupy the sentence collector for quite some time.
- How many sentences can be put into the sentence collector at once? Do I have to cut them into chunks?
- Do you guys in general think that using this data is a good idea in this way?
I believe this could increase the diversity of the sentences in the big languages a lot. Right now they all have the same dry style from Wikipedia, we could add more natural sentences from speeches in many languages with this data. And this is also a great chance to increase the numbers for some of the smaller European languages that don’t have a big Wikipedia.