Sentence collection tool development topic

nukeador · February 5, 2019, 7:58pm

Continuing the discussion from We want your feedback: Improving the sentence collection:

Hi all,

This topic is aimed just to developers who would like to help (react and kinto skills required)

What do we need?

Fork the project and test that you can run the environment locally following the instructions.
Is everything working as expected? If not, submit a new issue.
Review the pending issues on the next milestone.
Create a new PR to fix any of the existing issues in the most recent milestone.

Please add any questions here.

Thanks!

mkohler · November 14, 2018, 10:46pm

I have the tool running locally now. I will create a PR tomorrow with one fix and documentation updates so others can run it as well.

nukeador · December 17, 2018, 5:14pm

Quick update: We still have some pending issues for the MPV, ideally we would have a beta version to test before the end of the year, if the QA of that beta is satisfactory we can start using it to collect and review sentences.

Thanks for your patience

nukeador · December 20, 2018, 12:40pm

Hello everyone,

Today @mkohler and I did a project meeting and I wanted to share the notes here for your input/feedback.

Meeting notes - December 20th

Where are we?

We have closed most pending issues for the first milestone, we just have two pending ones:

Implement some general validations based on Deep Speech team recommendations.
Document the export process of the approved sentences.

Next steps

During the following weeks we want to implement the requirements and get a small group of people to play with the tool (quality testing) to make sure it doesn’t break.

Once we have confidence in the tool, we want to get a bigger group (probably the people who has been asking for a long time to get their sentences included) to use the beta version.

2019 planning

For next year there are a few thing we want to do:

Get everyone using the tool and turn it into the way to contribute sentences to the project.
Get most active people using the tool to help the ideation process on what are the things we want to improve about it (ownership of the tool’s roadmap).
Build a coding group to help @mkohler with code changes following the defined roadmap. We want community to have tech ownership too.

We would like your comments and feedback about this:

Does this direction make sense for you?
What are we missing?

Thanks!

nukeador · January 2, 2019, 11:35am

Happy new year everyone,

We have just finished the main issues for the sentence collector so now we need to run a QA (quality assurance) phase to make sure the tool is working as expected.

Note: Like this message if you are testing the site (so we know how many people got involved). Non-Latin languages testing would be appreciated

How to help with the QA

Access the testing site.
Try the following features: Login, Add sentences, Review sentences, Profile, How to
Play around with the site, is everything working as expected? Add some bad sentences so you can test the reject feature during review.
If you find a bug, please report it over github.

We will be running this QA for at least one week (Jan 9th) and then we will evaluate if we need more QA or we can move to a beta phase.

Thanks so much!

DNGros · January 3, 2019, 12:07am

I’m excited to see this up and running! Common voice is shaping into a really powerful dataset, but I’ve worried for a while whether there was enough sentence diversity in the dataset. Thanks for your work on this tool.

Below are listed some things where I had to second guess myself when reviewing or writing. These are areas where there might be need for better reviewer/submitter guidelines or candidates for a FAQ section for the guidelines.

Autovalidation specific issues

Should acronyms really be ruled invalid?
The current guidelines and autovalidator don’t allow abbreviations and acronyms because they are can be pronounced many ways. However, in my opinion the fact that they can be pronounced in fact makes them valuable to a dataset, not harmful. A STT/ASR system must be able to handle all the many ways users pronounce words and have a rich enough language model to capture acronym correctly. Pronounced acronyms like “FIFA” or “LIDAR”, as well as “spelled out” acronyms like “CO2”, “USB”, “FC Barcelona”, “FAQ” are common parts of language and potential user queries. I think they should be included in the dataset.
Longer max length
The current max length is set at 14 words. This is fairly short, and I found several the sentences I wrote being rejected. I would say in general we shouldn’t disallow longer sentences, as it is important for practical STT/ASR systems to be able to handle longer sentences (the one you are finishing reading is 24 words). More datadriven analysis could be done on what is typical sentence length for spoken sentences, but I would vote 30 words is a sensible max.
Number words
Certain words which include digits. It seems these should not be spelled out and digits should not be auto-rejected
Ex: “San Francisco 49ers”, “H2O”, “5G wireless network”.

Usability features

Inform user why sentences were autorejected (length, digits, abbreviation, etc).
Bug: Hitting “submit” on the add sentences form sometimes does nothing. For example when I tried to just copy and past the 438 sentences I wrote here https://pastebin.com/C0YpxtSJ it will not accept it and fails without any user feedback. Certain subsets of the paste are able to progress to the next screen. I’m not able right now to investigate further what is happening, but is is some bug. If I figure that issue out I can open an issue/PR on github.

Potential areas for improved guidlines

Is incorrect capitalization incorrect?
Ex: “I’ve heard that this hainanese chicken rice tastes amazing!”
hainanese should be capitalize.
However, this should not effect pronunciation and I would vote that the guidelines rule this valid.
Should misplaced commas be invalid?
Ex: “If we’re in London, and trying to download a cat picture from a website, based in Australia, fetching the data results in a lot of hops along a lot of networks.”
The comma after “website” is unnecessary and implies an unnatural pronunciation.
Ex: “We also emit CO2 when we burn fossil fuels to keep places we work well-lit and depending where you are in the world warm or cool enough to work in.”
This really should have some commas to be “grammatically correct”. However, it is still pronounceable, and thus could be accepted if that is what the guidelines encourage.
Clarify that dates are numbers which should be spelled out.
Ex: “In 2017 we saw an increase in growth.”
This might not be clear form the current guidelines
Hyphen in numbers or not? Or are both valid.
“nineteen eighty-eight” or “nineteen eighty eight”
emdashes
Ex: “When we look at this map, it gives us an idea of how clean the emissions are likely to be, on average for a country - for example we can see that in France, with its heavy reliance on nuclear energy, generating a kilowatt hour of electricity than Poland, which is heavily invested in coal.”
This sentence would good challenge to a system and is grammatically valid (other than the fact that it uses a hyphen rather than an em dash). However, a pannel2 of human transcribers given only the audio probably would not agree on the transcription as both a period and a dash sound similar. So it seems like this should be invalid. Agreed?
Should foreign characters common in a language be allowed?
For example: letters with diacritics sometimes appear in English when using words (especially foreign proper nouns) like “Nestlé”, “L’Oréal”, “exposé”, " José", and " Sofía". Should these be allowed? Currently they would get autorejected.
colloquial grammar / “as they said it spelling”
For example: “I ain’t gonna put up with this”, “why ya’ gotta be like that”, " That’s one mean lookin’ motorcycle man."
Is this allowed?
This could be valuable for a system to recognize as it might better reflect how some real users talk and system should not fail the user says “gonna” instead of “going to”. It isn’t “grammatically correct”, but this kind of writing is sometimes used in fiction to communicate a characters personality more.

All of these might not need specific guidelines, and for some we could just let the reviewers vote. However, I’ll add them here in case others also had the same questions and we want to try and standardize things more.

davidak · January 3, 2019, 7:59am

A general thing i noticed:

Why is this tool not part of Common Voice and follows it’s design?

More rules might be needed. For example don’t add offensive sentences violating the Code of Conduct. And a hint what good sentences are, for example general facts.

Also a hint where public domain resources can be found.

nukeador · January 3, 2019, 11:52am

This is just a quick tool to solve a small group of people need, in order to have it available sooner we decided not to create a super polished version but rather a functional version that can be improved over time. Ideally we want an integrated sentence collector during 2019 in the main portal.

Just filled:

nukeador · January 3, 2019, 11:30am

Thanks for the feedback, we have just followed the guidelines provided by the Deep Speech team in order to make sure the resulting sentences+voices are useful for the algorithms. @josh_meyer might be able to provide you a bit more detail on the reasoning behind.

Cheers.

nukeador · January 3, 2019, 11:46am

These two issues are tracking that:

dabinat · January 3, 2019, 4:14pm

Here are my first impressions:

The registration process is confusing - you have to first login with your desired username and password in order to register. I think there should be an actual registration page, even if it functions largely the same as the login page, because it’s more logical and mirrors how other sites work.
The review process was also confusing to me. You tap Yes and it turns green. Does that mean it’s approved? But it’s still there when I return to the page. Oh, I have to physically click the Submit button, which was right by the page number box, so I thought it was for changing the page. Having it auto-submit, hide the sentence and show the next sentence would be a lot better.
What is the purpose of the Skip button? I can choose to ignore sentences without consequence and I can navigate to any page I like, so what advantage does skipping provide?
It seems pretty easy to add invalid sentences. I entered a period for the sentence and the source and it told me the sentence already exists (implying it would have allowed it if it didn’t). I then changed it to a comma in the sentence box and it went through and ended up as a blank sentence.

Other suggestions:

I don’t think users should be allowed to review their own sentences.
Should it reject if the first letter of a sentence doesn’t begin with a capital letter? That might help to catch copy and paste errors where there was an errant newline in the middle of the sentence.

nukeador · January 3, 2019, 5:23pm

We decided to stick with the current login system to avoid adding additional work for this version, in the next deployment we have added some explanation about how it works in the same page.

Good idea, currently we are going to stick with this workflow (click yes on everything you want to approve and then finally submit) but we can see what other improvement we can do in the future, I’ve opened this issue to track it.

The skip button is broken right now, we plan to remove it in the next deployment.

Can you describe this in more detail in a new issue? I don’t know if I fully understand what you got.

This is intentional since we know a lot of people will be adding a lot of sentences from different sources other than their own creation.

Not sure, this might limit the ability for people to get sentences from long paragraphs and split them into smaller ones you can still read and make sense.

mkohler · January 3, 2019, 6:55pm

oh, I’ve filed https://github.com/Common-Voice/sentence-collector/issues/58 for this…

This might work in English, but then we would need to define this per language as not all language probably use uppercase?

I will fix a few more bugs now and then deploy again.

nukeador · January 4, 2019, 10:56am

A post was split to a new topic: Problems finding public domain sentences

mkohler · January 3, 2019, 8:09pm

I’ve deployed the latest version with many bugfixes, however the skip button will still be there. Will need to think about that removal a bit more. It should however not mark the sentence as approved/rejected anymore and do nothing. This is tracked in https://github.com/Common-Voice/sentence-collector/issues/44

mkohler · January 6, 2019, 10:51pm

I’ve deployed a lot more fixes for both bugs and UX topics. Would be great if all of you can keep testing to make sure I didn’t introduce new bugs

nukeador · January 8, 2019, 12:19pm

Thanks @txopi @DNGros @davidak @dabinat @jef.daniels and everyone who is helping with the QA phase, we have been able to fix a lot of issues and make the tool better thanks to your help.

Let’s do a final push in the next couple of days to make sure we are ready to move the tool to beta phase, let’s keep testing and reporting issues.

In the beta phase we will clean-up the database and offer to everyone in the common voice community who has been asking to get their sentences included to star using it as the main channel for sentence submission and review.

Sentences added and reviewed in the beta phase will start being incorporated in the main Common Voice site. We expect some languages to reach out the 5000 sentences and allow them to enable the voice collection

nukeador · January 9, 2019, 11:11am

5 posts were split to a new topic: Feedback on how we collect and validate sentences

mkohler · January 8, 2019, 8:22pm

I have deployed a new version with several fixes. See the “CHANGELOG” column in https://github.com/Common-Voice/sentence-collector/projects/1 . Thanks everyone for your feedback and reporting bugs, this is getting better and better

DNGros · January 8, 2019, 10:42pm

Ok, great. Sorry for not checking for existing issues before commenting.