Weekly Update Thread 2023

Hi everyone :tada:

Welcome back to the weekly update from the Common Voice Community team.

Common Voice at Africa AI Conference

Common Voice conducted a session at the Africa AI Conference titled: Africa Mradi - Using Mozilla Common voice to collect training data for different use cases. The interactive session took participants through a practical guide utilizing live examples of use cases developed through the use of the Common Voice datasets leveraging from the use cases developed through the africa Mradi work in Kiswahili and Kinyarwanda.The session covered:

  • Building AI models for under-resourced languages
  • Mozilla common voice platform
  • How to build training data using Mozilla common voice platform

SoGood 2023 – 8th Workshop on Data Science for Social Good

The possibilities of Data Science for contributing to social, common, or public good are often not sufficiently perceived by the public at large. Data Science applications are already helping in serving people at the bottom of the economic pyramid, aiding people with special needs, helping international cooperation, and dealing with environmental problems, disasters, and climate change. In regular conferences and journals, papers on these topics are often scattered among sessions with names that hide their common nature (such as “Social networks”, “Predictive models” or the catch-all term “Applications”). Additionally, such forums tend to have a strong bias for papers that are novel in the strictly technical sense (new algorithms, new kinds of data analysis, new technologies) rather than novel in terms of social impact of the application.
If you are interested in this workshop, checkout Workshop, affiliated with ECML-PKDD 2023, 18-22 September 2023, Torino, Italy.

Voice Data Collection

We are currently working on a “How to” guide for contributors who wish to gather voice data in their specific languages, please email us or comment on this update if you have any recommendations, suggestions or any input you would like us to include in the guideline.

Did this update miss something important? Are you doing something cool that we can help you show off? Reply to this thread, message @ or @gina_moape or say hello on Matrix or email commonvoice@mozilla.com and we’ll joyously include you in the next update.

1 Like

Hi everyone

Welcome back to the weekly update from the Common Voice Community team. :dancer:

Hubs Summer Series: The Future of Immersive Web in Education - Starting off this Thursday at 9am PT

Hubs Summer Series for Educators is a biweekly series that highlights innovative educators using AR/VR/XR as an engaging, effective teaching tool. The Hubs Summer Series is in collaboration with XRBootcamp for the Hubs Summer Series. XR Bootcamp is a global online academy offering beginner to advanced XR development courses. Curricula are developed and taught by award-winning, leading VR, AR, and MR professionals. XR Bootcamp’s courses are for designers, game developers, managers, and coders who want to learn how to prototype XR Applications and learn AR, VR, and MR development. RSVP Here Please extend this invitation to teachers, faculty, grad students working in this space.

SoGood 2023 – 8th Workshop on Data Science for Social Good

The possibilities of Data Science for contributing to social, common, or public good are often not sufficiently perceived by the public at large. Data Science applications are already helping in serving people at the bottom of the economic pyramid, aiding people with special needs, helping international cooperation, and dealing with environmental problems, disasters, and climate change. In regular conferences and journals, papers on these topics are often scattered among sessions with names that hide their common nature (such as “Social networks”, “Predictive models” or the catch-all term “Applications”). Additionally, such forums tend to have a strong bias for papers that are novel in the strictly technical sense (new algorithms, new kinds of data analysis, new technologies) rather than novel in terms of social impact of the application. If you are interested in this workshop, checkout Workshop site: affiliated with ECML-PKDD 2023, 18-22 September, Torino, Italy.

Did this update miss something important? Are you doing something cool that we can help you show off? Reply to this thread, message @jesslynnrose or myself @gina or say hello on Matrix or email commonvoice@mozilla.com and we’ll joyously include you in the next update.

Enjoy the rest of the week :tada:

1 Like

It’s you weekly-ish update, sneaking in a bit late this week.

Out big news this week is that there’s a new dataset release! :tada::tada::tada:
I’m especially excited to see Pashto, Albanian, Amharic and Standard Moroccan Amazigh joining the platform and dataset.

It’s also a special day for the Kiswahili language community tomorrow, with July 7th being UN Swahili Language Day our tireless Kiswahili language fellows will be running two field events in Tanzania and DRC tomorrow.

Did this update miss something important? Are you doing something cool that we can help you show off? Reply to this thread, message @jesslynnrose or myself @gina or say hello on Matrix or email commonvoice@mozilla.com and we’ll joyously include you in the next update.

1 Like

Welcome back to the weekly update from the Common Voice Community team. :tada:

World Kiswahili Day
On Friday, July 7th, World Kiswahili Day was celebrated, and as part of the celebration, our fellows hosted two consecutive events that ran simultaneously in Tanzania and Democratic Republic of the Congo from 9am East African time to noon. We had about 50 participants engaging in validation on the Common Voice platform.

Latvian Campaign
The Latvian community organized a campaign that took place from June 30th to July 9th, focusing on recording and validating activities.

Continental Cyber Security Policymaking Virtual Roundtable Discussion
Continental Cyber Security Policymaking: Implications of the Entry into Force of the Malabo Convention for Digital Financial Systems in Africa
The Carnegie Endowment for International Peace’s Technology and International Affairs Program and the Technology, Finance and Commerce Research Group of the School of Law, University of Bradford, United Kingdom invite you to join us for a conversation on Continental Cyber Security Policymaking and the relevance of the African Union Cyber Security and Personal Data Protection (Malabo Convention) to Africa’s Digital Financial Ecosystems, as pertains to the cybersecurity and data protection opportunities and challenges faced by pan-continental operators offering digital financial services.
The virtual roundtable is part of a series within Carnegie’s CyberFI project that surfaces perspectives on cybersecurity, capacity building and digital financial inclusion in Africa. If you are interested in attending this round table, register here

Did this update miss something important? Are you doing something cool that we can help you show off? Reply to this thread, message @jesslynnrose or myself @Gina_Moape or say hello on Matrix or email commonvoice@mozilla.com and we’ll joyously include you in the next update.

1 Like

Welcome back to the weekly update from the Common Voice Community team :tada:

Common Voice Case Studies
We are currently compiling case studies showcasing products, projects, and research that have used Common Voice datasets. If you know someone or you have developed or are working on something related to academic, health, or any other field and would like to share your project or experience, please reach out to us at commonvoice@mozilla.com.

Discussion on Ethical Data Collection
Fascinating discussion on modern data collection and the accompanying ethical considerations, featuring a Firefox shoutout.

Accelerating the ADOPTION of Conversational AI in the African Market BUILDING A STRONG CONVERSATIONAL AI ECOSYSTEM IN AFRICA Summit
AI will transform millions of lives in Africa and CHATBOT AFRICA’s goal is to help form a better ecosystem. CHATBOT AFRICA organizes a series of international conferences in Europe, Africa by bringing the leading professionals and organizations who design, build, and market conversational AI based technologies. Chatbot Africa is hosting a two days conference and exhibition designed to host industry executives, and adopters of Conversational AI, Chatbots, Virtual assistant, voice technology and Conversation design. If you would like to attend this summit get your ticket here. The summit will reflect the latest tendencies and recent application changes in the Conversational AI space in the African markets.

Did this update miss something important? Are you doing something cool that we can help you show off? Reply to this thread, message @jesslynnrose or myself @Gina_Moape or say hello on Matrix or email commonvoice@mozilla.com and we’ll joyously include you in the next update.

1 Like

Welcome back to the weekly update from the Common Voice Community team :dancer:

Wikimania 2023 Conference
Wikimania is the annual conference celebrating all the free knowledge projects hosted by the Wikimedia Foundation. This year’s conference will run from 16–19 August in Singapore at the Suntec Singapore Convention and Exhibition Centre and online. If you are interested in attending, register here.

We are looking for Common Voice Case Studies
We are currently compiling case studies showcasing products, projects, and research that have used Common Voice datasets. If you know someone or you have developed or are working on something related to academic, health, or any other field and would like to share your project or experience, please reach out to us at commonvoice@mozilla.com.

Did this update miss something important? Are you doing something cool that we can help you show off? Reply to this thread, message @jesslynnrose or myself @Gina_Moape or say hello on Matrix or email commonvoice@mozilla.com and we’ll joyously include you in the next update.

Welcome back to the weekly update from the Common Voice Community team :dancer:

Common Voice at ACM SIGCAS/SIGCHI Conference on Computing and Sustainable Societies, Cape Town

Common Voice was invited to the ACM SIGCAS/SIGCHI Conference. The goal of the conference is to dissect the complexities surrounding African language research, focusing on NLP’s role in sustainability. We aim to address topics like skill development, data standardization, policy frameworks, ethics, and unique approaches for the evolution of African languages’ NLP. This comprehensive approach aims to empower the participants to overcome challenges and contribute to the dynamic evolution of NLP for African languages. Leading the Common Voice project for South African languages, Professor Febe presented the current status of SA languages on our platform. We are excited to share that all 10 South African languages have been officially launched and are now open for voice contributions.

Wikimania 2023 started today and will run until the 19 August 2023 in Singapore

Wikimania is the annual conference celebrating all the free knowledge projects hosted by the Wikimedia Foundation. This year’s conference will run from 16–19 August in Singapore at the Suntec Singapore Convention and Exhibition Centre and online. If you are interested in watching, watch the live stream here.

Did this update miss something important? Are you doing something cool that we can help you show off? Reply to this thread, message @jesslynnrose or myself @Gina_Moape or say hello on Matrix or email commonvoice@mozilla.com and we’ll joyously include you in the next update.

Welcome back to another weekly update from the Common Voice Community team :dancer:

Common Voice Continues to Grow
Valencian writer Carles Cortés was very generous to share sentences from his book “Marta dibuixa ponts” with Common Voice.

Collective Data Rights and their Possible Abuse
If anyone finds it intriguing, here’s a captivating and brief piece discussing the potential hazards related to collective data rights.

Reclaiming the Digital Commons: A Public Data Trust for Training Data
Another interesting read article on the democratization of AI. The article highlights the dominance of large corporations in collecting and controlling data, leading to concerns of privacy and bias. It emphasizes the need for collaborative efforts among governments, civil society, and industry to create a balanced and accessible digital commons.

Did this update miss something important? Are you doing something cool that we can help you show off? Reply to this thread, message @jesslynnrose or myself @Gina_Moape or say hello on Matrix or email commonvoice@mozilla.com and we’ll joyously include you in the next update.

2 Likes

Welcome back to the weekly update from the Common Voice Community team :dancer:

MozFest House: Kenya, taking place 21-22 September, 2023
The Common Voice team will be attending the upcoming MozFest House in Nairobi at the Shamba House Cafe. The attendees will tackle pressing realities at the intersection of emerging technology and the African continent, such as digital extractivism and AI governance. If you will be around Nairobi at that time, come and join us, tickets are free, register here.

On AI News :computer:
The Athens Roundtable on AI and the Rule of Law
The Fifth Edition of The Athens Roundtable on AI and the Rule of Law will delve into pressing governance challenges posed across jurisdictions by foundation models and generative AI. The conference will be held from November 30th to December 1st, in person and via Zoom. If interested, join AI experts in thought-provoking discussions spanning emerging regulations, measurement and standards, enforcement, and international coordination. Register to attend online here.

Did this update miss something important? Are you doing something cool that we can help you show off? Reply to this thread, message @jesslynnrose or myself @Gina_Moape or say hello on Matrix or email commonvoice@mozilla.com and we’ll joyously include you in the next update.

Welcome back to the weekly update from the Common Voice Community team :dancer:

New Dataset Release
The CV team is preparing a new dataset release. The cutoff date will be this Friday the 8th of September 2023 and the release will be out by the end of the upcoming week. Thank you to all the contributors and supporters who play a vital role in the growth and success of this fantastic project!

Common Voice appeared in INTERSPEECH 2023
One of our community member’s work appeared in INTERSPEECH 2023. Read more about it here.

MozFest House: Kenya, taking place 21-22 September, 2023
**The Common Voice team will be attending the upcoming MozFest House in Nairobi at the Shamba House Cafe. The attendees will tackle pressing realities at the intersection of emerging technology and the African continent, such as digital extractivism and AI governance.

Did this update miss something important? Are you doing something cool that we can help you show off? Reply to this thread, message @jesslynnrose or myself @gina or say hello on Matrix or email commonvoice@mozilla.com and we’ll joyously include you in the next update.

2 Likes

New dataset release vibes :tada: :tada: :tada:

3 Likes

Hello Everyone :smiley:

Welcome back to the weekly update from the Common Voice Community team :dancer:

Dataset Release :dancing_men:
The dataset release is out! In this latest release, we’ve added two new languages, Hebrew and Afrikaans resulting in a significant addition of 633 hours compared to the previous version. This brings the total number of languages in the latest release to 114, totaling an impressive 28,750 hours, of which 19,159 hours have been validated.

GCP migration
The Common Voice team is passionate about offering our contributors and our dataset users the most stable long term experience possible. As part of that commitment, we’re moving our hosting from Amazon Web Services (AWS) to Google Cloud Platform (GCP) in coming weeks. This move is part of a wider effort by the Mozilla Foundation to move projects onto a single hosting provider to streamline costs and service needs. We’ll be moving from AWS to GCP on 08:00 UTC Monday September 18th, 2023. Because we’ll be switching cloud hosting providers, the Common Voice platform will be temporarily unavailable for up to 120 minutes starting from 08:00 UTC Monday September 18th, 2023.

Common Voice in Team Community’s Global Gathering in Portugal
Common Voice fellow Rebecca Ryakitimbo will be showcasing common voice at the Global Gathering Feira in Portugal from 15-17 September. The summit brings together managers and community leaders from the digital rights space to map out both external and internal emerging threats.

Common Voice at the 2023 Deep learning Indaba
Common Voice fellow Kathleen Siminyu represented CV at the 2023 Deep learning Indaba that was held in Ghana.

Did this update miss something important? Are you doing something cool that we can help you show off? Reply to this thread, message @Jess or myself @Gina_Moape or say hello on Matrix or email commonvoice@mozilla.com and we’ll joyously include you in the next update.

2 Likes

The metadata coverage visualisations for v15 release are now available, too.

Some interesting insights:

  • The Bangla / Bengali (bn) language corpus appears to be growing considerably
  • Similar for Belarusian (be)
  • Similar for Kiswahili (sw)
  • Similar for Mandarin Chinese (zh-CN)

With the rapidly growing language corpora, validation appears to remain a challenge, with a lot of utterances remaining unvalidated.

  • African languages such as Kiswahili (sw), Kinyarwanda (rw), Luganda (lg), Kabyle (kab) have some of the highest utterance to speaker ratios - that is, they have a small number of speakers who have donated a high volume of utterances. Meadow Mari (mhr) also fits into this category. I suspect that many languages with small language communities will face similar challenges. However, this doesn’t appear to be constraining the size of the train sets too much - which indicates sentence diversity within the constrained number of speakers. (side note: I need better data on this)
2 Likes

Great points @kathyreid. Just to add some more on your valuable insights:

Validation keeps falling back globally as you see on the above graph. Some languages keep up, but some languages are left as it is. You can see the validation jumps/keep-up during v7.0 and v8.0, but afterward, it is linear, and the recording penetration is more steeper, so the difference sums up.

I see two reasons for that:

  1. Some communities are formed project-based and after they reach some point (e.g. 100h validated) they seem to leave the dataset/project. The Uzbek language seems to be one of them.
  2. At the end of 2021, towards v8.0 Common Voice had a global social media campaign, which resulted in a nice jump, also emphasizing the importance of the validation process. During that period the project team had stronger relations with language communities, helping them. This is unfortunately not the case after 2022.

I think a similar global campaign would provide a solution for the validation backlog and also for the whole project. Building more knowledgeable core groups that do the validation continuously will solve the problem.

Beware “Validated%” will never be 100% if there are invalidated recordings (which usually range between 2-10% depending on the language).

This is usual for new languages, but lack of diversity is definitely a problem with older ones with larger datasets.

Again, working closer with these communities and performing global events will help a lot.

Such analysis of the complete text-corpora became impossible after the sentences went into the database. Without a periodical export of these sentences, we cannot analyze the sentence diversity, vocabulary coverage, how much of the text-corpus has been covered, etc. See this issue.

I have them analyzed until March 2023, and also have simple different sentence counts of each split (up-to-date) in the Dataset Analyzer though - Text-Corpus and Sentences tabs respectively. Perhaps, at least temporarily, I can add a text-corpora analysis based on validated.tsv, which might at least show the current status on validated.

Any idea on this would be very valuable @kathyreid

1 Like

Welcome back to the weekly update from the Common Voice Community team :dancer:

MozFest House: Kenya Sessions
The Common Voice team attended the MozFest House in Nairobi at the Shamba House Cafe. The sessions tackled pressing realities at the intersection of emerging technology and the African continent, such as digital extractivism and AI governance. If you’ve missed the sessions and would like to listen to any of them check the recordings here.

On AI News
AI chatbots were tasked to run a tech company. They built software in under 7 minutes — for less than $1.
This paper presents an innovative paradigm that leverages large language models (LLMs) for the software development process, streamlining and unifying key processes through natural language communication. The approach eliminates the need for specialized models at each phase. At this rate, I wonder what the AI landscape will look like in 5 to 10 years. What are your thoughts on this innovation? Share your thoughts and opinions in the comments section, lets have a conversation :smiley:

Did this update miss something important? Are you doing something cool that we can help you show off? Reply to this thread, message @Jess or myself @gina or say hello on Matrix or email commonvoice@mozilla.com and we’ll joyously include you in the next update.

2 Likes

I’ll be covering this topic in the lecture I’ll give at a conference next month. This paper/example is already in my presentation. I’ll inform you on that later.

1 Like

That’s great, will the conference be hybrid or online for interested people to attend?

Welcome back to the weekly update from the Common Voice Community team :dancer:

Common Voice Website
We highlighted the “Donate” button for the members and partners who would like to donate to CV.

Common Voice at the Forum on Internet Freedom in Africa
CV was at the Forum on Internet Freedom for Africa in Tanzania. The Forum on Internet Freedom in Africa marked a decade of the largest gathering on internet freedom in Africa, which has since 2014 put internet freedom on the agenda of key actors including African policy makers, platform operators, telcos, regulators, human rights defenders, academia, law enforcement representatives, and the media. Read more about the forum here.

Common Voice Fellow trailblazing the NLP and AI Space
Kathleen Siminyu, a Kiswahili machine learning fellow has been nominated for the Future 50, A New Generation of Leaders. Kathleen was nominated for democratizing machine learning, bringing speech and language technology to underserved communities. Read more and connect with Kathleen here.

Did this update miss something important? Are you doing something cool that we can help you show off? Reply to this thread, message @jesslynnrose or @Gina_Moape or say hello on Matrix or email commonvoice@mozilla.com and we’ll joyously include you in the next update.

Hey Everyone :dancer:
Welcome back to the weekly update from the Common Voice Community team.
Common Voice Use Case Studies:
We are currently compiling case studies showcasing products, projects, and research that have used Common Voice datasets. If you know someone or you have developed or are working on something related to academic, health, or any other field and would like to share your project or experience, please reach out to us at commonvoice@mozilla.com.

Call for Administrative Diversity Supplement Proposals
Blossom is looking to expand and diversify their team and they are seeking graduate students, postdoctoral fellows, and/or early career professionals to submit Administrative Diversity Supplement proposals. Blossom is an NIH-funded language learning platform that empowers parents and educators to nurture multilingual, bright, and resilient children. They are looking for Indigenous researchers with experience in education, linguistics, or computer science and who are interested in developing digital storybook tools and an online library of children’s books with traditional food themes, authored and illustrated by members of the community. You can read more about the project here. Transcendent Endeavors will pay each chosen candidate as a Senior Research Fellow to develop their own application, with their own self-directed research questions. If funded, the fellowship would begin in Summer 2024 and continue for 12 months with the possibility of full-time employment following the fellowship. For more information on this opportunity, contact Tori-Ann Williams, MS at tawilliams@t19s.com.

Did this update miss something important? Are you doing something cool that we can help you show off? Reply to this thread, message @jesslynnrose or @Gina_Moape or say hello on Matrix or email commonvoice@mozilla.com and we’ll joyously include you in the next update.

Hello Community :smiley:
Welcome back to the weekly update from the Common Voice community team.

Common Voice Use Case Studies: We are currently compiling case studies showcasing products, projects, and research that have used Common Voice datasets. If you know someone or you have developed or are working on something related to academic, health, or any other field and would like to share your project or experience, please reach out to us at commonvoice@mozilla.com.

Common Voice Writeathon Winners announcement: CV invites you to the virtual winners announcement event today for the Wanawake Mashujaa writing competition. The top authors documenting changemakers who were women in DRC, Kenya and Tanzania will be revealed! If you are free and keen, Join us in celebrating their work in Kiswahili and meet some of the women profiled. Register here.

Did this update miss something important? Are you doing something cool that we can help you show off? Reply to this thread, message @jesslynnrose or @Gina_Moape or say hello on Matrix or email commonvoice@mozilla.com and we’ll include you in the next update.