Hey Common Voice Community,
Save the date !
We would like to invite you to attend our Ask Me Anything (AMA) session with EM, Common Voice Lead.
You can ask questions to EM, MCV Lead via this topic from Monday 6th December 3pm-4 pm UTC.
Any questions, we are unable to answer live will be followed up with on a later date. Please abide by the Community Participation Guidelines, when proposing questions.
We look forward to answering your questions Any questions not answered within the hour, will be followed up.
This thread is now live. EM’s responses will be included in the toggle questions.
Question 1: Do you have plans to include a link to the Sentence Collector from the main CV page? Sentence collection is half of the process - and it is the harder part. It is just not visible.
EM’s response
We are considering adding it to the main menu - alongside some other enhancements to the sentence collector. One of the priorities we heard from contributors was that they wanted the sentence collector to be localisable - which the amazing mkohler and yourselves have been doing this year. We’re also in the middle of overhauling the About page, including giving much more prominence to the sentence collector. It’s important that we design changes to the menu holistically - but we will consider it in 2022 when our new engineers join.
Question 2: One of the major problems everybody is facing is the CC-0 requirement. Is it not possible for a copyright owner to donate selected/mixed sentences to CV only, without making the work public domain as a whole?
EM’s response
I’m so glad you asked this! We are in fact planning a review of our licensing in 2022 - we obviously can’t make any promises about what the outcome of this will be, as it will go through rigorous research and legal reviews etc. But let me just assure you that we have heard a lot from you in the communities about your needs and your frustrations with the current set-up. Watch this space for chances to support the research next year!
Question 3: Email requirements for registration. Can we have a smoother registration workflow?
EM’s Response
This isn’t feedback I’ve heard strongly yet, so thank you so much for raising it! I’d love to hear more about the friction points, and I can follow up with you separately.
Question 4: Currently, the default dataset splits do not take diversity into account. This is especially problematic for low-resource languages. Nearly nobody uses the default splits, except for benchmarking as they do not provide good results. Do you have any plans to solve this problem?
EM’s response
We are re-evaluating our test, train, dev splits (amongst other things) and we’re about to launch a simple survey for various consumers of the dataset. You can take part here https://docs.google.com/forms/d/1YZuKAW2399DeQ89IxLcfr6N-vG-bDfMaOMDMVL02VGo/edit
Question 5: Statistical data is distributed among pages, existing ones are mostly not current or just estimates. This kind of information (number of new users, daily contributions, etc) is very important for community leads, especially during campaigns. Do you have any plans to provide more detailed statistics for these purposes?
EM responses
Yes! In fact we were hoping to make an analytics dashboard available to the community this quarter with accurate representations of some of the top requests we get, however, due to delays getting more engineering resources on board this has been pushed to Q1 2022. We expect that some of your top requests - number of clips per language per day, number of dataset downloaders etc - will be available on a community dashboard early next year, and that we’ll continue to make enhancements and improvements through the year. Relatedly, we know that some communities are keen for improved team functionality, which is due to be scoped later in 2022.
Question 6: Are there any plans to collect domain-specific sentences/voices and tag them in the dataset? Such as technology, arts, law, economics, biology, medicine etc?
EM Response
There are no firm plans to do this, but we are open to all kinds of sentence metadata enhancements. Please feel free to share more context via our simple dataset survey https://docs.google.com/forms/d/1YZuKAW2399DeQ89IxLcfr6N-vG-bDfMaOMDMVL02VGo/edit