Sentence collection tool development topic

announcements
sentence-collection

#25

It would be great to hear from @josh_meyer or others involved with that team especially on the issue around acronyms as well as around how the 14 word limit was chosen. Specifically on acronyms I worry whether systems trained on Common Voice will be able to handle the acronyms that occur all over everyday speech, unless we allow them in the dataset.

Other areas that still seems to need more clarity is the issues around non-A-Z characters should be allowed. I was going through and reviewing some more sentences that are on the site now, and was not sure whether “Lucas played in the São Paulo soccer team” should be rejected or accepted.


(Pedro Lima) #26

I sent Portuguese sentences to the English one by accident, not sure if I mismatch the combobox or the site didn’t recognize Portuguese and the sent those sentences to the English dataset, How do I remove those sentences?


(Rubén Martín) #27

Currently there is no way to remove sentences, but don’t worry, we’ll clean the database before moving to the beta phase.


(Michael Kohler) #28

I’ve just deployed the latest changes to the website. You can find the changelog here: https://github.com/Common-Voice/sentence-collector/releases


(Jumasheff) #29

Hey, do you have an endpoint, where I can post all the scraped data I have? The texts are scraped from a news website (ky.kloop.asia) a founder of which has generously shared all of its contents under CC0 (via facebook chat conversation, a screenshot can be shared upon request). It’s boring to copy-paste all the news articles from 2011 to 2018, you know :slight_smile:
Or is it easier for you to accept all text data by email or other means? Perhaps a github repo with all the texts will work for you?


(Rubén Martín) #30

We want to force all sentences to go through the sentence collector interface. Why? Because we want to ensure we run a few clean-up algorithms on them and they are properly reviewed by the rest of the community (quality is super important for the machine learning algorithm).

If the problem you are describing is being able to send a lot of sentences, we are improving the tool to allow you to just c&p paste a massive amount of them in one single step. Please, test the tool and let us know how can we improve the workflow (maybe in the future we can have an endpoint for adding sentences with your username) :slight_smile:


(Rubén Martín) #31

Update (January 14th)

Again, thanks so much to everyone testing the tool and reporting issues, we have tremendously improved the quality of the tool in the past week and a half :partying_face:

Currently we don’t have any outstanding bugs, so if that remains the same we plan to clean-up the testing database and move to beta phase by the end of this week.

That means we still have a few days more to jump into the tool and make sure everything is working as expected.

We will also evaluate all feature requests and define priorities for the upcoming releases and we would love to get community involved in this, @mkohler and I will design a process to allow people over here to have the chance to influence what’s coming next.

Thanks!


#32

Does beta mean the sentences will be submitted to the real site or will they still be in a test database that gets wiped afterwards?


(Rubén Martín) #33

Sentences starting in the beta phase will be passed to the real site once they get enough positive reviews (currently 2/3, same as we are using for voice reviews).


(Rubén Martín) #34

I want to bring to the attention of the people participating in this topic that we have already opened a topic to decide all together what are going to be the tool next priorities, please chime in, we want to hear from you!