[ACTION REQUIRED] New Sentence Collector Infrastructure and Improvements

mkohler · October 7, 2020, 2:49pm

Hi everyone

We’re happy to announce a new version and address of the Sentence Collector. When you now head over to Common Voice you will be able to log in to the Sentence Collector using your usual credentials you’ve been using on the Common Voice website! The old address will forward you to its new home, so existing links in documentation and Discourse posts will continue to work.

Action Required - Migrate your account

If you previously had an account for the Sentence Collector, you can migrate it to the new website after logging in. Click on “Migrate Account” in the sidebar and enter your previous credentials. This will make sure you don’t need to configure your languages again as well as making sure that you’re not losing your statistics.

Please migrate your account until end of day of November 7th (no matter your time zone), as we will shut down the old instance.

Reach out to me or write in this Discourse post if you have any issues with the migration.

User-facing changes and improvements

You can now log in with Auth0, the same process as on the Common Voice website!
Improves performance in general
More granular statistics on their own page - updating every 6 hours
Statistics now show real value and are not cut off at 10k
More robust sentence submission and better duplication checks
Show both native and English language names in profile and front page
The submission ID is now saved, so in the future we can identify sentences that belong to the same submission
Possibility to download sentences list through API for further text analysis
Fixed “Swiping Review Tool” setting to work on first click
Show Source on Review Cards

Technical changes

MySQL backend
Faster export logic
New API endpoints, see README in the GitHub repo
General cleanup of issues in the GitHub repo
Refactoring of React components to use functions and hooks
Tests for backend code
Initial infrastructure for frontend tests
Dependency updates
Staging server

If you should find any bugs, please report them at GitHub · Where software is built.

Happy to answer any questions you might have.
Michael

sinumade · October 10, 2020, 8:44pm

日本語版: 新しいCollectorツール（2020年10月7日更新版）について

Can I ask a question in this topic?

Is this a spec?
- I have migrated my account. But the Migrate Account menu is still showing in the sidebar.
- When I click on the menu in the sidebar to open the Review page, nothing appears. When I refresh the page, a sentence or message appears (after that, it works fine).
- I can't log in to the platform and the Collector tool at the same time. In my environment, when I log in to one, I log out on the other. I'm not having trouble with it because I don't work at the same time.
About Statistics. I think there were about 150,000 "total sentences" in Japanese language before it was redesigned, but now there are only 2,316. Yes, I saw Megan's reply on Sentence collector copyright issues (thanks, Megan). Is this your work, Michael? if so, thank you. It means a normal change in numbers, right?
How does the sentence metadata identify the user? Can the "submission ID" identify a user? Will JSON files (e.g., https://kinto.mozvoice.org/v1/buckets/App/collections/Sentences_Meta_ja/records) continue to be useful in the future?
Will the information we set up in the platform's profile be shared in the Collector tool?

Name of what? Native language, does that mean user's? How does it identify the native language?

Personally, I would have liked to keep the voice activity and the sentence collection activity separate (for privacy reasons), but I think it's fine.

Thanks for the update, everybody!

mkohler · October 10, 2020, 9:14pm

We will remove that entry in a bit less than a month.

I’ll need way more info for this to be looked into. Can you record this?

Oh, good catch. I’ve filed Do not overwrite Common Voice Cookie · Issue #335 · common-voice/sentence-collector · GitHub for this.

That’s correct.

By saving a reference to the user. A submission id does not directly identify the user, but it’s easy to find out a user given a submission ID in the backend. And no, the kinto instance will be deleted once we remove the migration on November 7th or after. All sentences now live in a MySQL database and not in Kinto anymore.

The name of the language in that specific language, such as “Deutsch” (German).

Can you elaborate on that?

sinumade · October 11, 2020, 12:15pm

Thanks Michael!

2020-10-11-sentence-collector.avi [22 seconds / 121 MB]
- Summary:
  1. Go through the menu in the sidebar, in order.
  2. Open the Review page. [Nothing is displayed]
  3. Reload the page. [Messages are displayed]
  4. Open the Add page.
  5. Open the Review page. [Nothing is displayed]
- It's low quality to keep the video size down.
- The window size has been reduced to keep the video size down.
- There is a mosaic in the email address.
- It's not shown here, but I can review the sentence. I've tried it.

Please tell me if there is any shortage.

So, can the user search for other users and sentences?

Yes. In short, I profess to be on Common Voice, and I don't want people to guess what my "voice" is. And I'm worried that Mozilla (or other people) will tie my voice profile to my Discourse profile. As you can see, I'm obviously active when it comes to sentence collection, so I feel very close to the distance between Discourse and sentence collection. The integration of the platform and the Collector tool made me feel somewhat uneasy.

But I'm too worried about it. I'm paranoid.
Sorry. Please don't be offended; I'm skeptical of any organization, not just Mozilla. I'm a wimp.

Yes, both the Mozilla project and its community are valuable encouragement to me.

mkohler · October 11, 2020, 12:55pm

Thanks for the video, I could reproduce this. Will have a closer look. This seems to happen when there is only one language added to the profile.
Edit: will be fixed with the next deployment

There should IMHO not be any explicit functionality to search for all sentences by a given user without having access to the database. However while validating this I found a bug that would make this possible, which I’ll fix.
Edit: will be fixed with the next deployment

With something like Firefox Containers you might be able to use different login mechanisms for different Mozilla services, you could for example create a separate Firefox Account just for the Sentence Collector and log in with that. If you use different containers, you could nicely separate those without having to log out every time.

stergro · October 13, 2020, 11:42am

Great work, thanks! Especially that you can now see the source of every sentence will decrease the number of errors a lot!

sinumade · October 18, 2020, 9:33am

Yes, what I'm worried about is duplicate sentences and sources, so if the Collector tool automatically detects that, I might not need to be able to search for it.

The reason I'm so fixated on user searches, or who was involved in collecting the sentences, is because I want the user to feel responsible. It's good to see some interesting sentences coming out ...... but on the other hand, I'm worried about the use of language that could be hurtful to people. It's really nice to see more people participating, but in the aspect of self-governance, I have concerns.

Yeah, I'm waiting for the next update.

So containers and profiles are different functions. I didn't know that. Thanks for telling me about it.

@mkohler Did you put an announcements tag on it? Thank you.

I think this topic should be linked from the Collector tool. The Migrate Account page doesn't even have a closing date on it.

As I wrote in Announces "announcements", the announcements are inadequate. Volunteers and visitors are not being notified of any significant changes. The information is still stuck in Discourse.

Tell everyone what we do. I'm sure some people will be interested in the detailed changes.

Maybe, really, when a sentence is removed for any reason, we should announce it in the Collector tool. Volunteers will be in disbelief, especially when the numbers have moved significantly. Eventually, they will ask why in Discourse. And the users who collected the sentences may be displeased.

Think about it. Both the reporting of copyright issues and the decision to remove them are done "behind the scenes". Certainly the movement is public. But if there is no trigger, no one is going to look at it. It's not surprising that many users feel that they are not being taken seriously.

I have shown on my site how much Japanese sentences have been removed and why they were removed. I thought it would be sincere to do so. No matter how small the readership was (even if I was the only volunteer), reasons and changes should be shown.

I know it's hard to deal with 100+ languages. But if we're a team, it has to be shown.

Besides, we need to show that the Tanaka Corpus is a corpus that cannot be used in Common Voice, right?
The Collector tool should also have a complete list of such "disabled corpus".

Show the changes in the sentence collection and why.
A list of disabled corpus.
- Of course it would be nice if the Collector tool could filter it. But it's better to keep a list, so volunteers don't waste their time.

That's it for now.

sinumade · October 31, 2020, 12:42pm

What time zone is the due date based on? Developer's time zone?
It may be a small thing, but when we're dealing with people from all over the world, I think it's good to have that kind of consideration. It's a clear time limit, so to speak.

mkohler · October 31, 2020, 1:41pm

Thanks for the input, that’s a very good point. Let’s say “end of day” of November 7th, no matter your time zone to make it easy. I’ll also edit my original post to reflect that.

mkohler · November 9, 2020, 2:49pm

The Migration Code has now be removed.

Topic		Replies	Views
Sentence collection tool development topic Common Voice sentence-collection , announcements	30	4112	January 26, 2019
Sentence Collector Localization Update Common Voice sentence-collection	45	2038	January 16, 2022
The Sentence Collector is going to change! Common Voice	5	627	March 15, 2023
Sentence Collector Open Discussions - Input needed Common Voice sentence-collection	17	3710	October 2, 2020
We want your feedback: Improving the sentence collection Common Voice sentence-collection , feedback	34	8977	December 17, 2018

[ACTION REQUIRED] New Sentence Collector Infrastructure and Improvements

Action Required - Migrate your account

User-facing changes and improvements

Technical changes

Related topics