Dialect metadata in the Armenian dataset

Barev,
Starting from problem (2), from what you are telling it seems that even if one manage to move the sentence pronounced in “hyw way” that are currently in hye dataset to the new hyw dataset, it would be problematic because then even if the pronunciation would be kind of correct the sentences themselves would be not (if I understood well your points, it’s night here :wink: ). I think (but we need someone who worked with deepspeech in this conversation) that some percentage of hyw pronounced sentences in the hye dataset could even be good to create a wider model that can recognize a bit more situations (ex. a hyw speaking hye with accent, a situation that I think can happen easily).
Then the hyw project could be started from scratch, if 5000 sentences are collected and volunteers found. If you are really an angel you can mark the sentences that can be moved (with same syntax) and the sentences that is better not to move (hye syntax) but I understand it is a lot of work.

On the other hand, annotating users with hyw pronunciation can be something good, to do that you could use the 8th column of the tsv file, “accent”. To modify you could either do it with your own script if you are good with text manipulation or also with this software. I found it today here in the discourse, I tried to install it to see if it’s possible to order rows per client_id but it was not installing on my system.

In any case, this conversation should be seen by someone who is more inside the repositories because then the tsv files should be changed somehow and it must be approved and done by moderation. I still haven’t studied how the community works here but I imagine there are some people in charge of this.

Semi-OT: are the Armenian sentences good? My friend doing that said they were a bit strange and with very uncommon words. Maybe she just got some sentence coming from deep-in-topic wikipedia pages.