I have read Elizabeth’s post and it makes sense because a good ASR system can indeed be trained on both corpus. Neural networks are so fascinating systems that usually they can figure out how to solve these kind of issues by themselves but nonetheless I think it’s important somehow to have a mandatory classification implemented on Common Voice, in this way:
- We are sure that even if then someone will understand that for building a specific Armenian model is better to use a single variant, it will be easy to divide the dataset; For example someone wanting to creating just a hyw-optimized system or just future researches that could say it’s better to have differentiated training ecc…
- If someone want to use the dataset for other studies;
- To make a TTS system that can be set to match the desired pronunciation;
- For an automatic recognition of the variant used;
- In general, in building a dataset, it’s always fundamental to gather data and this is a very important data because we are not facing “just” a dialect (as it can be with italian from different regions) but phonological, syntactic and morphological differences. AFAIK is a bit different situations rather than with other languages, so much that even Wikipedia differentiates hye from hyw.
So it would be good to fix the current TTS file. Now there are just around 60 voices in the dataset, so it’s not too late to determine if it’s Eastern or Western and then update the TSV file directly on Mozilla’s servers in order not to create confusion. I know it’s a kind of hard decision that has to be discussed thoroughly but it’s important at least to open it.
To avoid this confusion in the future, my proposal is that the system should ask to each Armenian user if s/he is going to speak with Eastern or Western pronunciation, and this field should be mandatory so that all the registered users will be annotated (in theory it could be done even for non-registered users by prompting just before sending the recordings but I know it would need some slight modification to the code).
At least in this way a great part of the issue can be solved and, even if none of the points I made will occur (that can and probably will happen) at least we get the chances that the dataset will became useless closer to zero. And this is super important, not just to avoid that works of CV people and volunteers will be wasted, but also because CV is now the only public domain Armenian dataset, so a failure in the Armenian’s Common Voice would mean a failure of Armenian open source ASR and TTS for many years and a lost for an entire community.
Still remains open the issue about differences in sentences. Maybe it’s a minor issue compared with pronunciation (or maybe not, we need the answer of Armenians). This would be harder to solve with tweaks to the code and probably noone will actually do these modifications on github because it’s an harder work but, in this case, couldn’t be better to separate the languages, put them closer in the list and then when the dataset is created merge them? In this way the super correct Elizabeth’s osservation that hyw could be underlooked or developers in the confusion abandon the project will be avoided, because the downloaded dataset will contain just one set of TSV and one folder of MP3 because the merging will be done server-side. And it’s super easy even to code, just matter of a simple script based on cat and cp
I think these points are worth to be discussed. I generally agree with Elizabeth’s post but I would like to point out that in this way all her points will continue to be respected (because the dataset will be unique) but also our “safety concerns” will be satisfied and potentially the annotated Armenian dataset will become a super useful thing for Armenian studies in general.