Common voice sentences are the opposite of "common"

david-song · January 20, 2020, 1:41pm

If you want to avoid a middle class skew, it’d help a lot if the sentences weren’t sophisticated wordy guff.

I was working at Microsoft when they were building their data set and donated my voice. Most of their phrases looked to be lifted from MSN Messenger or similar. They were chatty phrases full of abbreviations and slang, rather than being wordy and eloquent.

I tried to get my dad and my brother to record a few phrases for common voice, but them being actual commoners they had trouble reading and pronouncing the given phrases. It’s just not the sort of thing they’d even listen to, let alone speak, and certainly not read!

I mean, people who find it easier to dictate rather than type are a thousand times more likely to use the word “bollocks” than “municipality”

Maybe scraping a load of phrases from IRC and pop culture pages on simple.wikipedia would help address this?

nukeador · January 20, 2020, 1:45pm

Hi @david-song welcome to the community!

One of the main challenges of this project is to have a public-domain text dataset big enough to accommodate the thousands of unique hours we need to have a solid dataset.

The most successful approach we have done to get 2M+ sentences to read was the wikipedia extraction (where we still need help).

If you happen know another big source of sentences with a public-domain license we can use, it would be great to plan our next steps into evolving the wikipedia-extractor tool to also be able to extract and clean-up sentences from other sources.

Thanks for your feedback!

david-song · January 20, 2020, 2:56pm

Thanks.

Just thinking out loud here, but I think that as much data as possible should really be collected from real-world usage in the sort of domains that it’ll actually be used in. So probably messaging, search and commenting, in that order.

It ought to be possible to use a browser plugin plus WhatsApp Web to scrape messages that we ourselves have sent (thus likely own the copyright to), then select a subset that are a) actually our own work and b) contain no personal information. Then split those into sentences and shuf 'em.

Similarly, Facebook status updates, comments and tweets from donors would be a good source. We could do something similar with searches, using web history as a source.

Finally, nagging users on IRC to opt-in, then use channel logs as a source and extract just the text from those specific users. Obviously only in chatty channels rather than technical ones.

nukeador · January 20, 2020, 4:34pm

Those are great ideas, I wonder how to balance privacy there. How many people will feel comfortable sharing that? I don’t really know, I will definitely won’t

Do you know any existing tools to do this kind of extraction or experience on others already doing it, so we can read some learnings?

Thanks!

david-song · January 20, 2020, 5:21pm

I’ve not got my laptop on me at the moment so can’t look at the WhatsApp Web thing, but I’ve done a fair bit of web automation in Selenium and suspect it’d be reasonably easy because it’s largely stable and pretty much unmaintained.

This is the “brute force and ignorance” method of how I’d do it if I couldn’t work out how to snag the JSON data directly from the phone:

Visit web.whatsapp.com then scan the QR code with your phone, and it sets up an ipv6/peer.js connection direct to your phone. There’s a list of conversations on the left, you’d extract those using a css or xpath selector on the browser’s DOM, then loop over them by clicking on them in turn. For each one, you’d then get the hight of the message panel on the right side, scroll up as far as you can, then see if the size has changed. Once it’s not got any bigger for, say, 10 seconds, you’d then use a css or xpath selector to get all the green message bubbles and copy them into an empty doc. Once all the conversations are done, let the user select the ones they want to share.

Facebook would be more difficult and would require constant maintenance, as they work to prevent scraping so the available scrapers tend to die a couple of times a year. It’d need some thought, data extraction would be a pain once the pages have made it to disk.

Search history would be easy enough. Just dump all URLs from the browser’s history and filter them with search URLs for the major search engines. Bonus points for language selection using the top level domain.

IRC would be easy. Have a bot that logs text to a file, and an on JOIN event for a channel that messages new users asking them to opt-in. Anyone who is identified with nickserv and responds with the trigger word gets added to the logging whitelist. Comments in a public channel are generally not private anyway, so no privacy issues there. Bonus points for reminding the user that they’ve opted in each time they join, and giving them stats and the chance to opt out. The project would of course need permission from each channel where the bot operates, and we’d need to avoid technical channels and focus purely on chatty ones.

If people are interested then these are things I could probably write given a couple of days of effort.

nukeador · January 20, 2020, 6:20pm

It would be also interesting to think about scale. For an effort like this to be useful we would need 100s of thousands sentences to be collected.

I know @irvin worked in the past with the chat idea and might have some learnings from the experience.

dabinat · January 20, 2020, 8:50pm

Do people chat the same way they speak? My experiences with IRC have been that a lot of people speak in an abridged way to limit the amount they have to type. (I always type full sentences but I seem to be in the minority.)

But what about some kind of chatbot app that you talk to with your voice? iOS has a built-in speech transcription API. You could record the user’s speech, transcribe on-device and then upload both speech and transcription for users to validate just as they do now.

I’ve been doing experiments training DeepSpeech on data that was transcribed via Google’s API and the results have been pretty good. It seems to almost always get words in frequent usage and only stumbles on obscure words or unusual names.

david-song · January 20, 2020, 9:48pm

Yeah I think you’re right, people don’t really speak how they type. But I think IRC is much closer to speech than the carefully constructed prose that you find on Wikipedia or a discussion forum. I don’t think I’d even speak the phrase “carefully constructed prose” in an ordinary conversation, it’d be less compressed, and more off the cuff like “stuff that’s been thought about and worked on for a bit to make it sound smarter,” which I guess is the point

irvin · January 21, 2020, 4:49am

We collect some Chinese sentences from the log of a cc0 pre-defined chat room on local OSS community slack, which everyone knows that in this room all the chat will de-identified and turn into public domain materials.

Which language you are trying to contribute? Perhaps you and your local community can do a similar things.

nukeador · January 21, 2020, 11:36am

@irvin how many new sentences you were able to collect? what was the time frame for collecting them?

I want to understand this in order to bring scale to the problem and figure out the most time-efficient strategies to keep growing our sentence dataset.

Thanks!

nukeador · January 21, 2020, 11:37am

My understanding is that most STT products explicitly forbid you to use them to train other algorithms.

irvin · January 21, 2020, 5:41pm

this is the log of the channel, which is not quite popular, 2000 sentences in 10 months. The time I need to manual cleaning them is about 4 hours.

Yes it does need manual cleaning such as de-identify, remove non-public domain compatible part like un-embed contents and make sure each one is speakable sentences.

nukeador · January 21, 2020, 6:18pm

Interesting, what do you think is the outcome of this effort compared with other ways to collect sentences manually, like the sentence collector events? (thinking on volume and time invested)

irvin · January 21, 2020, 6:38pm

People just won’t continuance filling their sentences to the collector along many months, but they chat every day. So we make it really easy to keep contribute daily, by telling them that “we will collect the sentences from this channel for the public benefit.”

It just works. Participants feel happy with it, and we found a channel to keep engaging them with Common Voice all the time in their daily life. (I share monthly stats about corpus collection and voice record/validate to the chatroom)

It just one of the places that I collect sentences. It’s good to have multiple diverse sources.

stergro · January 22, 2020, 9:21pm

I like the idea of scaping sentences from simple english wikipedia. English is the only language that can use two versions of wikipedia, we should use this if we havn’t done this yet. Simple english has around 150 000 articles.

david-song · January 23, 2020, 4:03am

I’ve had another idea. How about scraping Wiktionary’s usage examples? That would yield a bunch of sentences that are already agreed to be good examples of things people say.

https://en.wiktionary.org/wiki/Wiktionary:Example_sentences

nukeador · January 23, 2020, 12:24pm

Let’s elaborate a list of sources we are sharing here, together with their licenses.

Ideally we need public domain (CC-0) content.
If license is CC-BY (o other more permissive one) we’ll need to check with legal if we can do a similar process as wikipedia (analysis needs to be on case-by-case basis)
If the license is different, I would avoid considering the source.

david-song · February 2, 2020, 10:47pm

I’ve had another thought about the original topic of this thread.

Some sentences are more difficult to read than others, and some people have better reading comprehension than others. It ought to be possible to infer both of these by looking at how often a sentence is skipped, and how likely a user is to skip a sentence compared to other users.

Start by giving the user a low comprehension level and feed them sentences that match that level, plus some variance and unknowns. As their comprehension level increases, they get more difficult sentences.

This way, a new user with a poor reading level won’t be immediately put off, and their voice and accent will be trained on sentences that they’re more likely to utter.

Re: licensing, from the above post, I guess the collector can set the terms on anything that’s explicitly opt-in, so IRC, WhatsApp, Facebook, Twitter or mailing list posts won’t be a problem. Wiktionary and simple Wikipedia are a different issue though.

beiserjohannes · February 3, 2020, 7:25pm

I don’t see the problem in using un-common sentences when the corpora is used to train the acoustic model for DeepSpeech. Since the model doesn’t learn words but how a letter sounds like it will be a generic one anyways independent on which words it learned it from.

However, if the text was used for a language model it would be different.

Or am I complete off here?

david-song · February 3, 2020, 9:11pm

Currently it’s only collecting the accents of people who have excellent reading comprehension. Anyone who actually uses STT because they are crap at reading isn’t having their voice collected.

This is a huge barrier to collecting the voices of children and the strong accents of the working class, they just won’t record sentences that they can’t read. No hillbillies, nobody from the ghetto, no underclass. Just a bunch of largely university educated, middle class people, with the only diversity provided by middle class ESL students.

So my complaint was that it’s not using common people at all. I don’t know how that will translate into recognition performance, but I doubt we’ll get feedback from those people anyway. Maybe it’ll be easier to get feedback from our children?