@dabinat @Michael_Maggs One thing that we should start figuring out, is if there is any other large source of English text online we can ask our legal team to review in case we can do a similar “fair-use” extraction as we are doing with wikipedia and that provides us less boring and more diverse content.
I have some concerns about how the Wikipedia import was handled that hopefully we can avoid next time.
The wiki dump suddenly landed without warning and contributors like myself had to spend time cleaning it up after the fact. There should have been time for the community to review it before it went live.
Although we will eventually need millions of sentences, we don’t need millions right now, so I think a series of smaller imports over time would have been more manageable.
One thing that needs to be thought about is how to import new sources and have them actually show up for users to read. If sentences are chosen at random and the wiki dump has 1.4 million sentences, statistically most will come from there. So perhaps the wiki dump needs to be shortened? Maybe limit it to 20,000, store the remaining sentences elsewhere and periodically top it up as needed?
Just wanted to add that even though my script has filtered out around 100k sentences so far, there are still a lot of errors remaining.
The biggest problem right now is partial sentences. Because the wiki script filtered by punctuation, you end up with a lot of sentences like:
I will not be silenced, Mr.
He married Josephine Brock a.k.a.
Upgrades to Hwy.
And the import script seems to have a bug that removed certain numbers so you end up with nonsensical sentences like:
The island is around long and around wide.
The interior room measures approximately high by wide by deep.
Aust Cliff, above the Severn, is located about from the village.
These have proven not too difficult to remove. The ones I’m having difficulty with are the ambiguous ones:
Christina Aguilera feat.
I don’t know how to distinguish between “featuring” and “no mean feat”.
In all, A.
It’s difficult to tell the difference between a truncated sentence ending in a letter and a correct sentence ending in a letter such as:
Its dominical letter hence is A.
All international carriers operate out of Terminal A.
Another problem is the word “I”. Depending on the context, “I” can mean “myself”, “the First” (as in James I) or “one” (as in Schedule I) - or indeed simply the letter I.
I don’t think these problems are fully solvable with a script (unless it somehow understood context) so it should be accepted that any future imports will have errors that only human review could fix. So I think that human review should be part of the process, even if not every single sentence undergoes manual review.
For resource rich languages, e.g. English, there are sentence tokenizers, e.g. edu.stanford.nlp.process.DocumentPreprocessor, that can solve this problem before the import.
For resource poor languages people creating/curating valid sentences in the only real solution I see.
@dabinat, OK, I’ve added some more PD books. Ready to review.
I ran into the sentence
Baruch ata Adonai, m’sameiach chatan im hakalah.
while validating voice recordings in English. The next sentence to vaidate was in English as expected. Is this a recording for another language that shouldn’t have been played for me or was the English corpus tainted?
Hi @moonhouse, welcome! That sentence must have come from the big English Wikipedia import, as it appears on this page: https://en.wikipedia.org/wiki/Sheva_Brachot. Unfortunately, the sentences that were extracted from Wikipedia were not subject to the same quality-control as those that have come via the Sentence Collector and the error rate is quite high. @dabinat has been doing a lot of clean-up work but much more remains to be done. A ‘report-error’ button for readers has been requested but has not been implemented yet.
@Michael_Maggs I do wonder if it’s a lot easier to just put the Wikipedia sentences through Sentence Collector. Obviously it would take a long time to go through 1.4m sentences but we don’t need all 1.4m right now so it could be done in stages.
We want to avoid this option. Our experience told us that manual review is very time consuming, and people don’t engage in this activity. Doing this in the past has blocked us for months in most languages, that’s why we investigate on alternative ways to collect large sets of sentences.
The current thinking is to try to identify the problems that a large set of sentences import has and enable the mechanisms to automatically fix that, plus enabling the possibility for people to flag any corner cases directly in the app.
I think the scale is the issue here. With 1.4m sentences a “corner case” could amount to 30,000 sentences. That seems like an awful lot of sentences to expect people to flag up, especially bearing in mind that most people still record anyway, even if it’s in a foreign language. And that’s probably a conservative number because there are a lot of bad sentences making it through and they’re supposed to be random, so that would suggest the number of bad sentences may still be pretty high. All my script has done is tackle the low-hanging fruit.
We don’t need all 1.4m sentences right now, so I don’t see the slower pace of Sentence Collector as necessarily a bottleneck. (Plus it only took me a couple of days to go through Michael’s 4000(?) submissions.) But some manual work will definitely be required, whether that is through Sentence Collector or someone manually editing the file and creating PRs. But manually editing a 1.4m line file is pretty daunting so that’s why I think it should be broken up into multiple files - maybe 20,000 each?
I also want to add that I think the biggest reason Sentence Collector isn’t well used isn’t because people don’t want to submit or validate sentences but that it’s not well advertised. It does not use a Mozilla URL, which makes it unclear whether it’s officially connected to the project, and the link is hidden away in the FAQ. IMO it should be prominent on the homepage - Speak, Listen and Write.
That would be an assumable number, since that’s just 2,1% of the total. If you find yourself with a conflictive sentence twice in each 100 clips you record, it’s not a really bad experience.
We believe that 100% perfect sentences is not something we can expect for the scale we are looking for, so we need to make the minimum compromise to ensure a good user experience and proper model training.
For languages like English and Mandarin, we really need 1,8M because we want to accelerate and get to the 2K hours of validated voice in the coming months. The project really needs a few languages with a minimum viable dataset and models trained to start using the technology.
About this “acceleration”: We are working with communities but also with organizations and companies that are really looking forward to contribute to this acceleration, so we need to be prepared and have a buffer of sentences big enough to accommodate these new streams of contributions.
30,000 is just a number I threw out there. I don’t know the real number, but it’s certainly not 2% of clips I validate - I’d estimate it’s probably more like 30-40%.
By the way, I find bad sentences through voice validation, so I only know about ones that have already been recorded. There’s no feasible way for me to scroll through a 1.4m line file to find bad sentences manually. So I’m probably just scratching the surface.
Anything above 10% is a problem, if that’s the case we definetly need to flag it and make sure we clean them out. Is there a way we can have that percentage confirmed? Do you have a small sample where this is happening?
The random reviews we did before the import were way below 10%.
Can any other validators estimate what proportion of bad sentences they’re seeing? @Michael_Maggs
Yes, I also have difficulty finding bad sentences just by random scrolling. I don’t think it’s an effective way to do it. That’s what I mean by the scale being a problem and it would be more manageable if the wiki import was split up into multiple files.
We should distinguish between bad sentences and those that are too difficult.
Bad sentences are those that are wrong in some way. They may have grammatical errors, or features such as numbers or abbreviations which are not allowed. Too difficult sentences are those that our volunteers can’t read, and which as a result fail validation.
The Wikipedia dataset originally had a high proportion of bad sentences, but @dabinat’s scripts have reduced those now to reasonable levels (around 5% I’d guess). There is still a huge problem, though, with too difficult sentences, and I’m finding a current failure rate of 30-40%. That’s far higher than the base failure rate for Gutenberg sentences, of around 20%.
The main reasons I see for the high proportion of failures are:
Non-English proper names, especially of geographic locations in non English-speaking countries. Also, non-English names of people.
Difficult technical terminology, especially scientific words such as binomial Latin species names.
Complex sentence constructions requiring high levels of reading skill, particularly long sentences. Far more 14-word sentences are read wrongly than 10-word sentences.
Regardless of how difficult a sentence is, readers will try to tackle it. They don’t seem to know about or use the skip button.
A big issue is that since the WP sentences were uploaded the volunteer base has changed, and there are far fewer native English speakers now. We are strongly attracting English language learners - which is great for diversity, of course, but not good for getting standard pronunciations of words which require higher knowledge of English than many volunteers have. That somehow needs to be tackled (along with getting more women).
The 30-40% failure rate for WP sentences is I think a real problem. Not only are we driving volunteers away with sentences they can’t handle, we are wasting their time (and the time of two validation volunteers) by getting them to read anyway and then throwing away their efforts.
There are several things I’d like to discuss to try to deal with the too difficult sentences:
Withdraw all unread WP sentences, filter them more rigorously, as below, and put them back in batches of say 20,000 at a time so that they don’t swamp the more interesting Gutenberg book sentences.
Remove all WP sentences that include words that can’t be found in one of the bigger online English word lists. That will deal with the issue of non-English proper names and Latin scientific names, at the expense of some minor reduction in word diversity. I doubt the reduction would be significant, since Wikipedia is unlikely to use many otherwise-unknown words.
The complexity of the sentences we present for reading has to be matched to our volunteers. I would consider rejecting sentences with words that are very uncommon, when measured against publicly-available word frequency tables. Again, the reduction in word diversity won’t be significant since such words are tripping up readers in any event and are already being rejected on that basis.
Reduce the maximum allowable sentence length to 12 or even 10 words.
Finally, the whole question of the Sentence Collector needs more discussion. Its future usefulness will depend on how much programming effort the team is able to put in, how well it is advertised as an integral part of the CV project, and the availability of volunteers to do this more technical (and perhaps more boring?) work. What’s clear is that it can’t be used in its present state to validate many millions of sentences. If you are considering ditching it entirely, though, the lessons of the WP upload are that you’re going to need far more strict filtering of sentences to ensure reasonable levels of both accuracy and readability.
I can easily provide a million sentences from Gutenburg novels, but unless they can be dealt with efficiently in the same way as the WP or other bulk uploads made by the CV team, the Sentence Collector will act as an absolute block to them getting into the database.
Thanks for this @Michael_Maggs let me circulate all this feedback with the team and do a quick analysis on the evolution of rejected clips in English since we launched the new sentences.
I agree with all of @Michael_Maggs’s points and would certainly support his Gutenberg sentences bypassing Sentence Collector validation. They tend to be high quality and I vote down only a tiny fraction (way less than 1%).
(By the way, I have no idea how many of your recent submissions have been approved because they require two votes, but my own review queue is empty so feel free to submit more.)
I like the idea of importing from Wikipedia as the topics are truly random and it gets us words for modern technology like “laptop” and “cellphone” that aren’t in old books, as well as brand names and proper nouns.
But as I have said before, I think the Wikipedia import was badly handled. The community should have been given time to vet the sentences and import script and give feedback. My script is doing what the original import script should have done, but in a worse way because those sentences are removed completely instead of replaced with another random sentence. There are also bugs in the original script that cause random words to be erased and produce nonsensical sentences that I have to then try to detect.
(The reason I refrained from Michael’s suggestion of shortening sentence length was that I feared too many would be stripped out. The original script doesn’t have to worry about this as it can just replace those sentences with something else.)
So I think the best thing to do is keep whatever has already been approved and discard the rest, then work on improving the import script and importing higher quality sentences in batches of 20,000 or so.
As a clarification, we did community feedback on the script export before launching and we were way below 10% error rate, that’s why we went ahead with the import. Having said that, it is true that when we imported 1,2M sentences the error rate grew more than expected.
What I want to make sure is that we have still have enough sentences imported to cover the voice demand (we have an ongoing campaign) and people don’t record the same sentence twice, I don’t know if this is in batches of 20K or other figure.
I’ve passed the flag to the team and we will check on Monday for agreeing on an action based on all your feedback, which is tremendous useful and it’s shaping our thinking and helping us improve, thanks!