Sentence collection tool development topic

DNGros · January 8, 2019, 11:06pm

It would be great to hear from @josh_meyer or others involved with that team especially on the issue around acronyms as well as around how the 14 word limit was chosen. Specifically on acronyms I worry whether systems trained on Common Voice will be able to handle the acronyms that occur all over everyday speech, unless we allow them in the dataset.

Other areas that still seems to need more clarity is the issues around non-A-Z characters should be allowed. I was going through and reviewing some more sentences that are on the site now, and was not sure whether “Lucas played in the São Paulo soccer team” should be rejected or accepted.

Codigo_Logo_Programacao_e_Inteligencia_Artificial · January 9, 2019, 6:46pm

I sent Portuguese sentences to the English one by accident, not sure if I mismatch the combobox or the site didn’t recognize Portuguese and the sent those sentences to the English dataset, How do I remove those sentences?

nukeador · January 9, 2019, 9:16pm

Currently there is no way to remove sentences, but don’t worry, we’ll clean the database before moving to the beta phase.

mkohler · January 13, 2019, 8:08pm

I’ve just deployed the latest changes to the website. You can find the changelog here: https://github.com/Common-Voice/sentence-collector/releases

jumasheff · January 14, 2019, 10:13am

Hey, do you have an endpoint, where I can post all the scraped data I have? The texts are scraped from a news website (ky.kloop.asia) a founder of which has generously shared all of its contents under CC0 (via facebook chat conversation, a screenshot can be shared upon request). It’s boring to copy-paste all the news articles from 2011 to 2018, you know
Or is it easier for you to accept all text data by email or other means? Perhaps a github repo with all the texts will work for you?

nukeador · January 14, 2019, 1:04pm

We want to force all sentences to go through the sentence collector interface. Why? Because we want to ensure we run a few clean-up algorithms on them and they are properly reviewed by the rest of the community (quality is super important for the machine learning algorithm).

If the problem you are describing is being able to send a lot of sentences, we are improving the tool to allow you to just c&p paste a massive amount of them in one single step. Please, test the tool and let us know how can we improve the workflow (maybe in the future we can have an endpoint for adding sentences with your username)

nukeador · January 14, 2019, 5:34pm

Update (January 14th)

Again, thanks so much to everyone testing the tool and reporting issues, we have tremendously improved the quality of the tool in the past week and a half

Currently we don’t have any outstanding bugs, so if that remains the same we plan to clean-up the testing database and move to beta phase by the end of this week.

That means we still have a few days more to jump into the tool and make sure everything is working as expected.

We will also evaluate all feature requests and define priorities for the upcoming releases and we would love to get community involved in this, @mkohler and I will design a process to allow people over here to have the chance to influence what’s coming next.

Thanks!

dabinat · January 14, 2019, 9:29pm

Does beta mean the sentences will be submitted to the real site or will they still be in a test database that gets wiped afterwards?

nukeador · January 14, 2019, 10:36pm

Sentences starting in the beta phase will be passed to the real site once they get enough positive reviews (currently 2/3, same as we are using for voice reviews).

nukeador · January 16, 2019, 2:15pm

I want to bring to the attention of the people participating in this topic that we have already opened a topic to decide all together what are going to be the tool next priorities, please chime in, we want to hear from you!

nukeador · January 22, 2019, 4:06pm

Today @mkohler and I meet to talk about the tool development

Meeting notes (January 22nd)

Align on the list of locales

We want to make sure the locales and their codes are aligned with voice web and pontoon.

We will coordinate with @gregor and @pmo about new languages.
We feel that the request for a new locale should come from common voice site and then internally coordinate to replicate the same names and codes.

Testing the export process

Before moving to beta we really need to test everything is working as expected when exporting validated sentences.

@mkohler will test the export (and measure export times)
- Based on that we will make a decision on how often to export data to voice web.
- We need to get to an agreement on file format and naming with @gregor
Once @mkohler has admin access to the kinto db
- Provide Ops team with execution commands to run.
Clean-up the DB and create pending locales
- We will need to test the site to ensure we are ready for beta.
Comms about beta to community will be created and published in all channels.

February Roadmap

We talked about the Sentence collection tool - February 2019 milestone discussion and agreed on priorities (I’ll update the post tomorrow)

nukeador · January 25, 2019, 9:33pm

Hi everyone,

Thanks for your great efforts helping to identify and report issues and ideas for the tool, we have advanced a lot!

We are ready now to move to the next phase and in order to launch the functional beta version next week, during this weekend, we will proceed to clean-up the existing database and remove all testing sentences. Please hold your testings for now.

As soon as we launch the beta version we will be doing extensive comms to all common voice community and start using the tool as the way to submit, review and validate sentences. Sentences validated starting in the beta phase will be incorporated to the main voice-web site!

Stay tuned for the exciting beta launch announcement

mkohler · January 26, 2019, 1:48pm

Export of approved sentences: https://github.com/MichaelKohler/voice-web/tree/sc_export_before_flush/server/data

Export of all sentences: https://github.com/MichaelKohler/voice-web/tree/sc_export_all_before_flush/server/data/

Please note:

There are sentences that were submitted before we added the validation
Being approved doesn’t mean that those are OK, can also have been for testing

If that looks good, I will run the flush script tomorrow/Monday.