Common Voice Sentence Collection Tool launch

nukeador · January 28, 2019, 11:45am

Hello everyone,

I’m super excited to announce that after a few months of intense work, today we launch the Sentence Collection Tool site for all Common Voice contributors. We are considering this a first beta version, but fully functional after some weeks of testing.

All sentences submitted, reviewed and validated using this tool will be incorporated into the main Commmon Voice site. We will point this as the way to submit sentences to the project moving forward.

What is this tool?

This tools facilitates the task of submitting, reviewing and validating sentences in different locales and to be incorporated into the main Common Voice site, so people can read them and donate their voice.

Why this tool?

The previous process to gather sentences was a but unstructured, too many places to go and unclear guidelines. In order for sentences to be useful for the Deep Speech algorithm, there are certain “hard requirements” this tool enforces to avoid problems in the future.

We also aim to keep improving the tool to make the experience even easier for everyone!

How can I start using it?

Just go to the Sentence Collection tool site and start submitting and reviewing sentences in your locales. Make sure you check the How-to page to understand how to use the tool.

Where do I report issues or ideas?

Our github project page is the best place to report any issues with the site. If you want to discuss with the rest of the community an idea or new feature you can do that in our discourse.

How can I help with the development?

This tool is developed by the Common Voice volunteers. Anyone can be involved in the development, you just need to know react or kinto and chime in our github project to know more.

If you are not technical, don’t worry! We usually open conversations on discourse to get everyone the chance to influence the direction of the project.

Special thanks

I would like to extend a special recognition and thank you to some people who have been responsible for this tool to be launched.

@mhenretty for his idea and initial development
@MKohler for taking the technical lead as volunteer.
@gregor for his support from the voice web side.
Deep Speech team for their guidance on validation (@josh_meyer, @kdavis)
Kinto team for their support optimizing the code (@leplatrem)
Every volunteer who was involved in the QA testing phase during the last weeks (you were really fundamental)
- @ftyers
- @mozillakab
- @gtimoshaz
- @Txopi
- @tauheedul
- @irvin
- @danielsjf
- @whehd16
- @dcela
- @freaktechnik

Thank you everyone!

belkacem77 · January 28, 2019, 11:59am

Nice .We were waiting for that and sorry if I could n’t give a hand with the devs

txopi · January 29, 2019, 12:09am

Thank you everyone for the work done during last weeks and months.
I made a final revision of Basque sentences which is available on this wiki and uploaded all them to the sentence collector with 0 errors. So they are 6.212 sentences ready in Basque.
Should I make a PR with the final sentences as you suggested three days ago or review the sentences in the sentence collector? or perhaps both things?

nukeador · January 29, 2019, 12:46am

We are not accepting new PRs with sentences anymore, every new sentence should go through the sentence collector. We will be reviewing all PRs merged in the past months to ensure they don’t contain invalid words.

If this is about fixing already merged sentences, please coordinate with @gregor for the process.

Thanks!

belkacem77 · January 29, 2019, 8:00am

@nukeador Hi
And what can we do if there are errors among the old sentences pulled on Github that can’t be imported on the new sentence collector?

txopi · January 29, 2019, 8:59am

Good question.

In case of Basque, were zero recordings have been done, if no PR will be accepted, the best option is to discard the sentence files and work only with the sentence collection tool. If you try to mix all sentences, all the corrections I did the last days (related to vocabulary, concordance, puntuation…) will generate duplicated sentences. The old ones + the new ones in the same dataset.

In case of languages with recordings, I undertand the difficulties, but that’s not the Basque case. It has zero recordings and mixing the files’ sentences with the tool’s sentences will generate a problem. @gregor can you clean Basque data please?

ruba.awayes · January 29, 2019, 10:19am

Hello Ruben,
Thank you everyone who participated and made this amazing work.
I tried to login and creating an account using different alphanumeric, no email username, but unfortunately it always gave me “Login failed.”
So what should I do to resolve this ?

nukeador · January 29, 2019, 11:33am

Have you tried with another username? It might be the case you are using an existing one. If possible please open an issue about this problem.

gregor · January 29, 2019, 11:38am

Removing the sentence files/lines indeed stops them from being served. So for Basque I’d ask you to create a PR that removes what shouldn’t be there anymore. I’ll merge it promptly.

Thanks!

txopi · January 30, 2019, 1:19am

I created the PR that updates the Git sentences with the sentence collection tool’s sentences. That way, we’ll have all the correct and updated sentences in both places.

nukeador · February 7, 2019, 7:09pm

Hello,

I wanted to share a quick update. After just 1 week and a half, these are some of the stats:

Number of sentences collected: 192 899 (43K if we exclude the huge 150K Japanese corpus)
Number of languages: 20
Number of validated sentences: 11 183
Number of users registered: 173

This is just amazing, we didn’t expected this volume, extraordinary work everyone!

Note: We are actively exporting validated sentences to the main Common Voice app and they should appear as soon as the next deployment happens.

Cheers.

nukeador · February 25, 2019, 1:08pm

Week 4 update:

Number of sentences collected: 254K
Number of languages: 27
Number of validated sentences: 19K

Weekly increase: +4 locales, +34K sentences, +4K validated

nukeador · March 5, 2019, 12:36pm

Week 5 update:

Number of sentences collected: 331K (+77K)
Number of languages: 32 (+5)
Number of validated sentences: 27K (+8K)

nukeador · March 11, 2019, 4:00pm

Week 6 update:

Number of sentences collected: 361K (+30K)
Number of languages: 38 (+6)
Number of validated sentences: 33K (+6K)

nukeador · March 27, 2019, 1:13pm

Week 8 update:

Number of sentences collected: 395K (+34K)
Number of languages: 40 (+2)
Number of validated sentences: 59K (+26K)

nukeador · April 2, 2019, 9:20pm

3 posts were split to a new topic: Suggestions for the sentence collector

Topic		Replies	Views
Sentence collection tool development topic Common Voice sentence-collection , announcements	32	3997	January 26, 2019
Sentence collection tool - February 2019 milestone discussion Common Voice sentence-collection	12	1812	January 25, 2019
The Sentence Collector is going to change! Common Voice	5	590	March 15, 2023
Sentence Collector Community Survey Common Voice	2	510	November 11, 2022
New Version of Sentence Collector and future plans Common Voice sentence-collection	2	583	November 15, 2019