Dialect metadata in the Armenian dataset

The Common Voice dataset has around 2 hrs of Armenian speech. The corpus is useful but it doesn’t have metadata that specifies the dialect of the speaker. Specificaly, there are two core dialects (Western and Eastern) and they have substantial differences in their phonology. For example, anything that is spelled with տ,ռ in Eastern is pronounced as [t,r]; while those letters are pronounced [d,ɾ] in Western. The differences are substantial enough that Wiktionary has different IPA transcriptions per dialect for each lemma. You can see how much different they are via WikiPron.

Although it makes sense to want to pool speakers from both dialects into this corpus, it would also be useful if the dataset specified which speakers are from which dialect. Right now, if I download the dataset from the website, there’s 2000 sound files (one file per sentence per speaker) and it can get messy trying to figure out which set of files are for which dialect speaker. Is there a way for someone who’s not the dataset creator (like me), to go into the system and provide metadata per speaker (even if by guessing)?

1 Like

Hello :slight_smile:

Welcome to the common Voice discourse. and thanks fro your question,.

This year we started to include variants for languages with consultation with language contributors.

We will be starting our second round of consultation of variants before the end of June. Hopefully, with your input and other Armenian contributors, we could have these as variants

You can learn more about this process in the community playbook

3 Likes

Hello

I saw the updates that happened to the community playbook and the associated github repo. Does that mean that the second round has started already? If not, is there some sort of email-list I can get into, so that I can ultimately help in doing variant metadata for the Armenian collection?

Hello, I’m also following this topic.
I have the possibility to spread Common Voice in an Armenian community, but I am waiting for this issue to be fixed. From what I know about Armenian (even if I don’t know this language) I think the most correct thing to do is to separate Eastern from Western Armenian as two different languages. I am seeing that this is already happening (there is Western armenian as “in progress” in the language page).
If the question is closed, wouldn’t be better to rename Armenian in “Eastern Armenian” or “Armenian (Eastern)” to avoid possible confusion?
On the other hand @Hossep_Dolatian could you please do a rapid check of the dataset to confirm that is mostly eastern? Considered that now it’s just 5 hours I think a small percentage of western would not be problematic (even a good thing)… depending from your time even a 10 minutes check can be very useful, I can ask a person to do the same thing.
In general, I think doing an Armenian dataset is extremely important; I checked the amount of hours available in general for training ASR systems and it’s very small, even the OpenAI’s Whisper was trained on Armenian for just 13 hours.
I’ve found this paper about armenian datasets for ASR, I still have to read it but seems interesting.

Hi @manalog
Language/difference: So technically, Western and Eastern are the same language politically and genetically, but they have quite different phonology/morphology/syntax, so that’s why they’re treated as separate dialects. But you’re right that the basic division should be Armenian (Eastern) vs Armenian (Western), with ISO codes hye and hyw; that’s how Wiktionary handles them (see the pronunciation entries for a word գրել)

VC content: So I was one of the participants in submitting recordings. Based on past conversations, it seems that the VC content is mostly Eastern but with a sizable Western section. The problems though are the following:

  1. I can of course listen to a sample from each client_ID to determine if they have a Western vs Eastern accent, but then it’s not obvious to me how I can place that information back into the VC metadata.
  2. The Western recordings are somewhat methodologically problematic. The participants would read a sentence from Eastern Armenian and say it in a Western Armenian accent. The problem is the two dialects have different orthographic and syntactic/morphological conventions. So what’s a highly frequent word in Eastern Armenian does not exist in Western Armenian, and vice versa. For example, the word he said is [ɑsɑts] ասաց in hye, but [əsav] ըսաւ in hyw. So whereas the hye data looks clean for learning an hye ASR model, it’s difficult to evaluate how clean the hyw data looks for learning an hyw variant of the hye ASR model.

Problem (2) is more of a question on how to do future data-collection sessions. While problem (1) is more of a question of yes I'd like to help make new metadata, but how can I update the existing metadata

Barev,
Starting from problem (2), from what you are telling it seems that even if one manage to move the sentence pronounced in “hyw way” that are currently in hye dataset to the new hyw dataset, it would be problematic because then even if the pronunciation would be kind of correct the sentences themselves would be not (if I understood well your points, it’s night here :wink: ). I think (but we need someone who worked with deepspeech in this conversation) that some percentage of hyw pronounced sentences in the hye dataset could even be good to create a wider model that can recognize a bit more situations (ex. a hyw speaking hye with accent, a situation that I think can happen easily).
Then the hyw project could be started from scratch, if 5000 sentences are collected and volunteers found. If you are really an angel you can mark the sentences that can be moved (with same syntax) and the sentences that is better not to move (hye syntax) but I understand it is a lot of work.

On the other hand, annotating users with hyw pronunciation can be something good, to do that you could use the 8th column of the tsv file, “accent”. To modify you could either do it with your own script if you are good with text manipulation or also with this software. I found it today here in the discourse, I tried to install it to see if it’s possible to order rows per client_id but it was not installing on my system.

In any case, this conversation should be seen by someone who is more inside the repositories because then the tsv files should be changed somehow and it must be approved and done by moderation. I still haven’t studied how the community works here but I imagine there are some people in charge of this.

Semi-OT: are the Armenian sentences good? My friend doing that said they were a bit strange and with very uncommon words. Maybe she just got some sentence coming from deep-in-topic wikipedia pages.

Yes

Potentially. Currently there’s a team in Paris working on making a cross-dialectal ASR system, and I’m helping them gather audio corpora. So solving the Common Voice problems would spread around.

Potentially. There is a research group that’s involved in doing Armenian ASR. Though I don’t think they’re doing open datasets because of national security reasons (the Azerbaijan invasions). But potentially if/when a hyw expansion is launched, I can reach out via their volunteer networks.

TBD if I will be such an angel :blush:

I would’ve just played with excel.

It’s deep Wikipedia I think, a lot of high-register fancy words.

Annotating these sentences could be the right way to go also for the researchers working on a cross-dialectal system. If clips are flagged is easy to move them where they think it can be more useful; them in particular but in general everyone that will work with the dataset.
Now we need a reaction from someone that decide if it’s possible or no to allow for manual modification of tsv files.

Three quick responses:

1 Like

For the second point, yeah it would make sense to have a merger of the two so that they can share ASR resources – assuming that variants are defined so that the pronunciations of the two dialects can be distinct (as it is on Wiktionary).

1 Like

I have read Elizabeth’s post and it makes sense because a good ASR system can indeed be trained on both corpus. Neural networks are so fascinating systems that usually they can figure out how to solve these kind of issues by themselves but nonetheless I think it’s important somehow to have a mandatory classification implemented on Common Voice, in this way:

  1. We are sure that even if then someone will understand that for building a specific Armenian model is better to use a single variant, it will be easy to divide the dataset; For example someone wanting to creating just a hyw-optimized system or just future researches that could say it’s better to have differentiated training ecc…
  2. If someone want to use the dataset for other studies;
  3. To make a TTS system that can be set to match the desired pronunciation;
  4. For an automatic recognition of the variant used;
  5. In general, in building a dataset, it’s always fundamental to gather data and this is a very important data because we are not facing “just” a dialect (as it can be with italian from different regions) but phonological, syntactic and morphological differences. AFAIK is a bit different situations rather than with other languages, so much that even Wikipedia differentiates hye from hyw.

So it would be good to fix the current TTS file. Now there are just around 60 voices in the dataset, so it’s not too late to determine if it’s Eastern or Western and then update the TSV file directly on Mozilla’s servers in order not to create confusion. I know it’s a kind of hard decision that has to be discussed thoroughly but it’s important at least to open it.
To avoid this confusion in the future, my proposal is that the system should ask to each Armenian user if s/he is going to speak with Eastern or Western pronunciation, and this field should be mandatory so that all the registered users will be annotated (in theory it could be done even for non-registered users by prompting just before sending the recordings but I know it would need some slight modification to the code).

At least in this way a great part of the issue can be solved and, even if none of the points I made will occur (that can and probably will happen) at least we get the chances that the dataset will became useless closer to zero. And this is super important, not just to avoid that works of CV people and volunteers will be wasted, but also because CV is now the only public domain Armenian dataset, so a failure in the Armenian’s Common Voice would mean a failure of Armenian open source ASR and TTS for many years and a lost for an entire community.

Still remains open the issue about differences in sentences. Maybe it’s a minor issue compared with pronunciation (or maybe not, we need the answer of Armenians). This would be harder to solve with tweaks to the code and probably noone will actually do these modifications on github because it’s an harder work but, in this case, couldn’t be better to separate the languages, put them closer in the list and then when the dataset is created merge them? In this way the super correct Elizabeth’s osservation that hyw could be underlooked or developers in the confusion abandon the project will be avoided, because the downloaded dataset will contain just one set of TSV and one folder of MP3 because the merging will be done server-side. And it’s super easy even to code, just matter of a simple script based on cat and cp :smiley:

I think these points are worth to be discussed. I generally agree with Elizabeth’s post but I would like to point out that in this way all her points will continue to be respected (because the dataset will be unique) but also our “safety concerns” will be satisfied and potentially the annotated Armenian dataset will become a super useful thing for Armenian studies in general.

Regarding the dialects: Well, as an hyw speaker who was never around hye speakers, I don’t understand hye speakers when they speak around me (10% comprehension) or maybe 30-50% when they speak to me. The difference between hye and hyw is farther than US vs UK English, but maybe around Arabic dialects. And that’s part of the motivation why hye and hyw have separate ISO codes. Now if I have corpus A on hye, and corpus B on hyw, I see no problem in merging them as long as the original information (of dialectal origin) is maintained somewhere so that information isn’t lost.