Common Voice Sentence Collection Tool launch

Hello everyone,

I’m super excited to announce that after a few months of intense work, today we launch the Sentence Collection Tool site for all Common Voice contributors. We are considering this a first beta version, but fully functional after some weeks of testing.

All sentences submitted, reviewed and validated using this tool will be incorporated into the main Commmon Voice site. We will point this as the way to submit sentences to the project moving forward.

What is this tool?

This tools facilitates the task of submitting, reviewing and validating sentences in different locales and to be incorporated into the main Common Voice site, so people can read them and donate their voice.

Why this tool?

The previous process to gather sentences was a but unstructured, too many places to go and unclear guidelines. In order for sentences to be useful for the Deep Speech algorithm, there are certain “hard requirements” this tool enforces to avoid problems in the future.

We also aim to keep improving the tool to make the experience even easier for everyone!

How can I start using it?

Just go to the Sentence Collection tool site and start submitting and reviewing sentences in your locales. Make sure you check the How-to page to understand how to use the tool.

Where do I report issues or ideas?

Our github project page is the best place to report any issues with the site. If you want to discuss with the rest of the community an idea or new feature you can do that in our discourse.

How can I help with the development?

This tool is developed by the Common Voice volunteers. Anyone can be involved in the development, you just need to know react or kinto and chime in our github project to know more.

If you are not technical, don’t worry! We usually open conversations on discourse to get everyone the chance to influence the direction of the project.

Special thanks

I would like to extend a special recognition and thank you to some people who have been responsible for this tool to be launched.

  • @mhenretty for his idea and initial development
  • @MKohler for taking the technical lead as volunteer.
  • @Gweber for his support from the voice web side.
  • Deep Speech team for their guidance on validation (@josh_meyer, @kdavis)
  • Kinto team for their support optimizing the code (@leplatrem)
  • Every volunteer who was involved in the QA testing phase during the last weeks (you were really fundamental)

Thank you everyone!

6 Likes

Nice .We were waiting for that and sorry if I could n’t give a hand with the devs

2 Likes

Thank you everyone for the work done during last weeks and months.
I made a final revision of Basque sentences which is available on this wiki and uploaded all them to the sentence collector with 0 errors. So they are 6.212 sentences ready in Basque.
Should I make a PR with the final sentences as you suggested three days ago or review the sentences in the sentence collector? or perhaps both things?

1 Like

We are not accepting new PRs with sentences anymore, every new sentence should go through the sentence collector. We will be reviewing all PRs merged in the past months to ensure they don’t contain invalid words.

If this is about fixing already merged sentences, please coordinate with @gweber for the process.

Thanks!

@nukeador Hi
And what can we do if there are errors among the old sentences pulled on Github that can’t be imported on the new sentence collector?

Good question.

In case of Basque, were zero recordings have been done, if no PR will be accepted, the best option is to discard the sentence files and work only with the sentence collection tool. If you try to mix all sentences, all the corrections I did the last days (related to vocabulary, concordance, puntuation…) will generate duplicated sentences. The old ones + the new ones in the same dataset.

In case of languages with recordings, I undertand the difficulties, but that’s not the Basque case. It has zero recordings and mixing the files’ sentences with the tool’s sentences will generate a problem. @gweber can you clean Basque data please?

1 Like

Hello Ruben,
Thank you everyone who participated and made this amazing work.
I tried to login and creating an account using different alphanumeric, no email username, but unfortunately it always gave me “Login failed.”
So what should I do to resolve this ?

Have you tried with another username? It might be the case you are using an existing one. If possible please open an issue about this problem.

Removing the sentence files/lines indeed stops them from being served. So for Basque I’d ask you to create a PR that removes what shouldn’t be there anymore. I’ll merge it promptly.

Thanks!

1 Like

I created the PR that updates the Git sentences with the sentence collection tool’s sentences. That way, we’ll have all the correct and updated sentences in both places.

1 Like

Hello,

I wanted to share a quick update. After just 1 week and a half, these are some of the stats:

  • Number of sentences collected: 192 899 (43K if we exclude the huge 150K Japanese corpus)
  • Number of languages: 20
  • Number of validated sentences: 11 183
  • Number of users registered: 173

This is just amazing, we didn’t expected this volume, extraordinary work everyone! :smiley:

Note: We are actively exporting validated sentences to the main Common Voice app and they should appear as soon as the next deployment happens.

Cheers.

4 Likes

Week 4 update:

  • Number of sentences collected: 254K
  • Number of languages: 27
  • Number of validated sentences: 19K

Weekly increase: +4 locales, +34K sentences, +4K validated

1 Like

Week 5 update:

  • Number of sentences collected: 331K (+77K)
  • Number of languages: 32 (+5)
  • Number of validated sentences: 27K (+8K)

Week 6 update:

  • Number of sentences collected: 361K (+30K)
  • Number of languages: 38 (+6)
  • Number of validated sentences: 33K (+6K)

Week 8 update:

  • Number of sentences collected: 395K (+34K)
  • Number of languages: 40 (+2)
  • Number of validated sentences: 59K (+26K)
2 Likes

3 posts were split to a new topic: Suggestions for the sentence collector