Dialect metadata in the Armenian dataset

The Common Voice dataset has around 2 hrs of Armenian speech. The corpus is useful but it doesn’t have metadata that specifies the dialect of the speaker. Specificaly, there are two core dialects (Western and Eastern) and they have substantial differences in their phonology. For example, anything that is spelled with տ,ռ in Eastern is pronounced as [t,r]; while those letters are pronounced [d,ɾ] in Western. The differences are substantial enough that Wiktionary has different IPA transcriptions per dialect for each lemma. You can see how much different they are via WikiPron.

Although it makes sense to want to pool speakers from both dialects into this corpus, it would also be useful if the dataset specified which speakers are from which dialect. Right now, if I download the dataset from the website, there’s 2000 sound files (one file per sentence per speaker) and it can get messy trying to figure out which set of files are for which dialect speaker. Is there a way for someone who’s not the dataset creator (like me), to go into the system and provide metadata per speaker (even if by guessing)?

1 Like

Hello :slight_smile:

Welcome to the common Voice discourse. and thanks fro your question,.

This year we started to include variants for languages with consultation with language contributors.

We will be starting our second round of consultation of variants before the end of June. Hopefully, with your input and other Armenian contributors, we could have these as variants

You can learn more about this process in the community playbook