Dialect metadata in the Armenian dataset

manalog · October 8, 2022, 12:42am

Barev,
Starting from problem (2), from what you are telling it seems that even if one manage to move the sentence pronounced in “hyw way” that are currently in hye dataset to the new hyw dataset, it would be problematic because then even if the pronunciation would be kind of correct the sentences themselves would be not (if I understood well your points, it’s night here ). I think (but we need someone who worked with deepspeech in this conversation) that some percentage of hyw pronounced sentences in the hye dataset could even be good to create a wider model that can recognize a bit more situations (ex. a hyw speaking hye with accent, a situation that I think can happen easily).
Then the hyw project could be started from scratch, if 5000 sentences are collected and volunteers found. If you are really an angel you can mark the sentences that can be moved (with same syntax) and the sentences that is better not to move (hye syntax) but I understand it is a lot of work.

On the other hand, annotating users with hyw pronunciation can be something good, to do that you could use the 8th column of the tsv file, “accent”. To modify you could either do it with your own script if you are good with text manipulation or also with this software. I found it today here in the discourse, I tried to install it to see if it’s possible to order rows per client_id but it was not installing on my system.

In any case, this conversation should be seen by someone who is more inside the repositories because then the tsv files should be changed somehow and it must be approved and done by moderation. I still haven’t studied how the community works here but I imagine there are some people in charge of this.

Semi-OT: are the Armenian sentences good? My friend doing that said they were a bit strange and with very uncommon words. Maybe she just got some sentence coming from deep-in-topic wikipedia pages.

Topic		Replies	Views
Help preserving dialects from vanishing by allowing to add a dialect flag to spoken language Common Voice	16	1951	February 10, 2020
Issues in the Romanian dataset Common Voice sentence-collection , feedback , issue	7	354	February 28, 2025
:speaking_head: Feedback needed: Languages and accents strategy Common Voice participation , feedback	54	7428	March 25, 2020
Common Voice Toolbox: Updated with CV v22.0 data Common Voice feedback , tooling	20	3405	November 19, 2025
Adyghe-multi dialects in a single dataset Common Voice	12	1361	November 11, 2019

Dialect metadata in the Armenian dataset

Related topics