Sentence collection tool development topic

(Rubén Martín [away until April 24th]) #9

Thanks for the feedback, we have just followed the guidelines provided by the Deep Speech team in order to make sure the resulting sentences+voices are useful for the algorithms. @josh_meyer might be able to provide you a bit more detail on the reasoning behind.



(Rubén Martín [away until April 24th]) #10

These two issues are tracking that:



Here are my first impressions:

  • The registration process is confusing - you have to first login with your desired username and password in order to register. I think there should be an actual registration page, even if it functions largely the same as the login page, because it’s more logical and mirrors how other sites work.

  • The review process was also confusing to me. You tap Yes and it turns green. Does that mean it’s approved? But it’s still there when I return to the page. Oh, I have to physically click the Submit button, which was right by the page number box, so I thought it was for changing the page. Having it auto-submit, hide the sentence and show the next sentence would be a lot better.

  • What is the purpose of the Skip button? I can choose to ignore sentences without consequence and I can navigate to any page I like, so what advantage does skipping provide?

  • It seems pretty easy to add invalid sentences. I entered a period for the sentence and the source and it told me the sentence already exists (implying it would have allowed it if it didn’t). I then changed it to a comma in the sentence box and it went through and ended up as a blank sentence.

Other suggestions:

  • I don’t think users should be allowed to review their own sentences.

  • Should it reject if the first letter of a sentence doesn’t begin with a capital letter? That might help to catch copy and paste errors where there was an errant newline in the middle of the sentence.


(Rubén Martín [away until April 24th]) #12

We decided to stick with the current login system to avoid adding additional work for this version, in the next deployment we have added some explanation about how it works in the same page.

Good idea, currently we are going to stick with this workflow (click yes on everything you want to approve and then finally submit) but we can see what other improvement we can do in the future, I’ve opened this issue to track it.

The skip button is broken right now, we plan to remove it in the next deployment.

Can you describe this in more detail in a new issue? I don’t know if I fully understand what you got.

This is intentional since we know a lot of people will be adding a lot of sentences from different sources other than their own creation.

Not sure, this might limit the ability for people to get sentences from long paragraphs and split them into smaller ones you can still read and make sense.


(Michael Kohler) #13

oh, I’ve filed for this…

This might work in English, but then we would need to define this per language as not all language probably use uppercase?

I will fix a few more bugs now and then deploy again.


(Rubén Martín [away until April 24th]) split this topic #14

A post was split to a new topic: Problems finding public domain sentences


(Michael Kohler) #15

I’ve deployed the latest version with many bugfixes, however the skip button will still be there. Will need to think about that removal a bit more. It should however not mark the sentence as approved/rejected anymore and do nothing. This is tracked in


(Michael Kohler) #16

I’ve deployed a lot more fixes for both bugs and UX topics. Would be great if all of you can keep testing to make sure I didn’t introduce new bugs :slight_smile:

1 Like

(Rubén Martín [away until April 24th]) #17

Thanks @txopi @DNGros @davidak @dabinat @jef.daniels and everyone who is helping with the QA phase, we have been able to fix a lot of issues and make the tool better thanks to your help.

Let’s do a final push in the next couple of days to make sure we are ready to move the tool to beta phase, let’s keep testing and reporting issues.

In the beta phase we will clean-up the database and offer to everyone in the common voice community who has been asking to get their sentences included to star using it as the main channel for sentence submission and review.

Sentences added and reviewed in the beta phase will start being incorporated in the main Common Voice site. We expect some languages to reach out the 5000 sentences and allow them to enable the voice collection :smiley:

1 Like

(Rubén Martín [away until April 24th]) split this topic #18

5 posts were split to a new topic: Feedback on how we collect and validate sentences


(Michael Kohler) #23

I have deployed a new version with several fixes. See the “CHANGELOG” column in . Thanks everyone for your feedback and reporting bugs, this is getting better and better :slight_smile:



Ok, great. Sorry for not checking for existing issues before commenting.



It would be great to hear from @josh_meyer or others involved with that team especially on the issue around acronyms as well as around how the 14 word limit was chosen. Specifically on acronyms I worry whether systems trained on Common Voice will be able to handle the acronyms that occur all over everyday speech, unless we allow them in the dataset.

Other areas that still seems to need more clarity is the issues around non-A-Z characters should be allowed. I was going through and reviewing some more sentences that are on the site now, and was not sure whether “Lucas played in the São Paulo soccer team” should be rejected or accepted.


(Pedro Lima) #26

I sent Portuguese sentences to the English one by accident, not sure if I mismatch the combobox or the site didn’t recognize Portuguese and the sent those sentences to the English dataset, How do I remove those sentences?


(Rubén Martín [away until April 24th]) #27

Currently there is no way to remove sentences, but don’t worry, we’ll clean the database before moving to the beta phase.


(Michael Kohler) #28

I’ve just deployed the latest changes to the website. You can find the changelog here:

1 Like

(Jumasheff) #29

Hey, do you have an endpoint, where I can post all the scraped data I have? The texts are scraped from a news website ( a founder of which has generously shared all of its contents under CC0 (via facebook chat conversation, a screenshot can be shared upon request). It’s boring to copy-paste all the news articles from 2011 to 2018, you know :slight_smile:
Or is it easier for you to accept all text data by email or other means? Perhaps a github repo with all the texts will work for you?


(Rubén Martín [away until April 24th]) #30

We want to force all sentences to go through the sentence collector interface. Why? Because we want to ensure we run a few clean-up algorithms on them and they are properly reviewed by the rest of the community (quality is super important for the machine learning algorithm).

If the problem you are describing is being able to send a lot of sentences, we are improving the tool to allow you to just c&p paste a massive amount of them in one single step. Please, test the tool and let us know how can we improve the workflow (maybe in the future we can have an endpoint for adding sentences with your username) :slight_smile:

1 Like

(Rubén Martín [away until April 24th]) #31

Update (January 14th)

Again, thanks so much to everyone testing the tool and reporting issues, we have tremendously improved the quality of the tool in the past week and a half :partying_face:

Currently we don’t have any outstanding bugs, so if that remains the same we plan to clean-up the testing database and move to beta phase by the end of this week.

That means we still have a few days more to jump into the tool and make sure everything is working as expected.

We will also evaluate all feature requests and define priorities for the upcoming releases and we would love to get community involved in this, @mkohler and I will design a process to allow people over here to have the chance to influence what’s coming next.




Does beta mean the sentences will be submitted to the real site or will they still be in a test database that gets wiped afterwards?