Ask Me Anything (AMA) session with Common Voice Lead

heyhillary · September 23, 2024, 12:22pm

Hey Common Voice Community,

Save the date !

We would like to invite you to attend our Ask Me Anything (AMA) session with EM, Common Voice Lead.

You can ask questions to EM, MCV Lead via this topic from Monday 6th December 3pm-4 pm UTC.

Any questions, we are unable to answer live will be followed up with on a later date. Please abide by the Community Participation Guidelines, when proposing questions.

We look forward to answering your questions Any questions not answered within the hour, will be followed up.

This thread is now live. EM’s responses will be included in the toggle questions.

Question 1: Do you have plans to include a link to the Sentence Collector from the main CV page? Sentence collection is half of the process - and it is the harder part. It is just not visible.

EM’s response

We are considering adding it to the main menu - alongside some other enhancements to the sentence collector. One of the priorities we heard from contributors was that they wanted the sentence collector to be localisable - which the amazing mkohler and yourselves have been doing this year. We’re also in the middle of overhauling the About page, including giving much more prominence to the sentence collector. It’s important that we design changes to the menu holistically - but we will consider it in 2022 when our new engineers join.

Question 2: One of the major problems everybody is facing is the CC-0 requirement. Is it not possible for a copyright owner to donate selected/mixed sentences to CV only, without making the work public domain as a whole?

EM’s response

I’m so glad you asked this! We are in fact planning a review of our licensing in 2022 - we obviously can’t make any promises about what the outcome of this will be, as it will go through rigorous research and legal reviews etc. But let me just assure you that we have heard a lot from you in the communities about your needs and your frustrations with the current set-up. Watch this space for chances to support the research next year!

Question 3: Email requirements for registration. Can we have a smoother registration workflow?

EM’s Response

This isn’t feedback I’ve heard strongly yet, so thank you so much for raising it! I’d love to hear more about the friction points, and I can follow up with you separately.

Question 4: Currently, the default dataset splits do not take diversity into account. This is especially problematic for low-resource languages. Nearly nobody uses the default splits, except for benchmarking as they do not provide good results. Do you have any plans to solve this problem?

EM’s response

We are re-evaluating our test, train, dev splits (amongst other things) and we’re about to launch a simple survey for various consumers of the dataset. You can take part here https://docs.google.com/forms/d/1YZuKAW2399DeQ89IxLcfr6N-vG-bDfMaOMDMVL02VGo/edit

Question 5: Statistical data is distributed among pages, existing ones are mostly not current or just estimates. This kind of information (number of new users, daily contributions, etc) is very important for community leads, especially during campaigns. Do you have any plans to provide more detailed statistics for these purposes?

EM responses

Yes! In fact we were hoping to make an analytics dashboard available to the community this quarter with accurate representations of some of the top requests we get, however, due to delays getting more engineering resources on board this has been pushed to Q1 2022. We expect that some of your top requests - number of clips per language per day, number of dataset downloaders etc - will be available on a community dashboard early next year, and that we’ll continue to make enhancements and improvements through the year. Relatedly, we know that some communities are keen for improved team functionality, which is due to be scoped later in 2022.

Question 6: Are there any plans to collect domain-specific sentences/voices and tag them in the dataset? Such as technology, arts, law, economics, biology, medicine etc?

EM Response

There are no firm plans to do this, but we are open to all kinds of sentence metadata enhancements. Please feel free to share more context via our simple dataset survey https://docs.google.com/forms/d/1YZuKAW2399DeQ89IxLcfr6N-vG-bDfMaOMDMVL02VGo/edit

heyhillary · November 25, 2021, 2:58pm

heyhillary · December 6, 2021, 12:08pm

bozden · December 6, 2021, 2:12pm

Thank you for this opportunity… I have several questions, but I think many more people are also trying to find answers to these… These are the topics that have been constantly coming up in Discourse and Matrix discussions or meetings.

Do you have plans to include a link to the Sentence Collector from the main CV page? Sentence collection is half of the process - and it is the harder part. It is just not visible.
One of the major problems everybody is facing is the CC-0 requirement. Is it not possible for a copyright owner to donate selected/mixed sentences to CV only, without making the work public domain as a whole?
Are there any plans to collect domain-specific sentences/voices and tag them in the dataset? Such as technology, arts, law, economics, biology, medicine etc?
One can only reach large crowds through social media campaigns and most users are using social media through mobile apps. But the WebView style browsers prevent CV’s main functionality, recording, which is mostly a deal-breaker. We faced this many times in our campaigns - we can only say “copy-paste the link”. Do you have any solution for this?
Do you have plans for providing a secure e-mail service for language communities? Volunteers/moderators who are shaping the community have no means to access other donators.
Currently, the default dataset splits do not take diversity into account. This is especially problematic for low-resource languages. Nearly nobody uses the default splits, except for benchmarking as they do not provide good results. Do you have any plans to solve this problem?
CV does not set goals or impose limits for individuals. This usually results in some people making too many recordings, trying to get to the top of the contribution lists, etc, which is not good for diversity and many hours of these recordings cannot be used in final splits. Do you have plans to remedy this?
Statistical data is distributed among pages, existing ones are mostly not current or just estimates. This kind of information (number of new users, daily contibutions, etc) is very important for community leads, especially during campaigns. Do you have any plans to provide more detailed statistics for these purposes?

Thank you for your time…

Em.Lewis-Jong · September 23, 2024, 12:21pm

Hey everyone! I’m E-M, the Product Lead for Common Voice at Mozilla Foundation. I have met some of you already and am excited to answer your Qs today!

ftyers · December 6, 2021, 3:13pm

I’d add to that, is there any plan on letting communities decide how their data should be licensed, or at least allow other CC licences? (Allowing CC-BY-SA would solve many problems [although not all]).

daniel.abzakh · December 6, 2021, 3:28pm

Hello,

Going through the Common Voice pilot campaign for the Abkhazian team, one of the main issues that we encountered, was email requirement for registration.
This created a bottleneck in the process for the following reasons:

Some people don’t have a usable email, so we had to create ghost emails for each one of them.
Registration through email requires jumping back and forth in different windows, this is confusing for many people.
Sometimes registration fails through email for unknown technical reasons.

Questions:

Can we allow to register without an email?
Can we have a smoother registration workflow?
(i.e. In smart phones, having the registration button on the main page, once clicked, a modal appears to fill in the username, password (once), age, sex, accent.)

Em.Lewis-Jong · December 7, 2021, 6:23pm

Thanks so much everyone for your questions! Sorry I couldn’t get to everything but this won’t be the last chance for us to speak, emails to commonvoice@mozilla.com will find me excited for all the improvements coming to you in 2022 - EM

bozden · December 6, 2021, 4:16pm

Thank you again

daniel.abzakh · December 7, 2021, 6:07pm

This doesn’t seem to be the right address.

Em.Lewis-Jong · December 7, 2021, 6:23pm

oh sorry! my fault - we actually haven’t transitioned it yet - this should be commonvoice@mozilla.com - have edited above as well

robovoice · January 2, 2022, 6:31pm

Plot twist: You have got email.

Em.Lewis-Jong · September 23, 2024, 12:19pm

Em.Lewis-Jong · September 23, 2024, 12:19pm

Em.Lewis-Jong · September 23, 2024, 12:19pm