Creating Open and Accessible Resources for Uzbek: The Project

As we all know, language technology is rapidly advancing, and the need for diverse and accessible datasets is increasingly growing. However, data resources are scarce, specifically for languages with smaller populations. Uzbek is one such language, spoken by approximately 33 million people. The team recognized the importance of developing an open-source text and voice dataset for the Uzbek language in an effort of creating a more equitable and inclusive future for voice tech. To date, they have collected about 1400 hours of high-quality audio with accompanying texts, all of which are publicly available and hosted on Google Drive for convenient accessibility.

Creating such as dataset requires a significant investment of time, resources, and expertise. The team behind has put in a lot of hard work and dedication to creating this valuable resource for their community. Availing of this dataset will provide the NLP community, researchers, and developers with the data resources needed to create speech recognition and natural language processing applications for Uzbek. The work of is a testament to the power of collaboration and community. By creating an open, accessible resource, the team is helping to create a more equitable and inclusive future for all. When we work together and share our resources, we can overcome the barriers of geography, language, and culture to build a better world.

Common Voice recognizes and commends the outstanding efforts of the team in collecting and sharing a large dataset of text and audio recordings for the Uzbek language.

To connect with the team, contact Mukhammad Amin Kodirov on