Changes to some letters in alphabet

lachin.kuz · June 12, 2021, 6:35pm

Hi,
We are about to start recording sentences for Uzbek. However, within an unknown period of time, some letters in our alphabet will be changed, for example, oʻ to ō or ö.

The new alphabet is still under consideration with various variants for a single letter and we are not going to wait. So, if we upload sentences with the current letters, will there be a way to edit reviewed sentences with clips then?

ftyers · June 12, 2021, 6:41pm

Dear @lachin.kuz thanks for your question. This is something that we have been discussing recently. There is currently no mechanism for implementing orthographic reforms in Common Voice, so you will need to do the pre-/post-processing outside of the system.

Do know however that it is an issue we know about and are considering how to address.

lachin.kuz · June 12, 2021, 7:11pm

Thanks for your fast reply, Francis.

It means, to avoid confusion we will have to stick to the alphabet we start with, even though there is a new release of letters, isn’t it?

I hope to see a solution within CV soon.

mkohler · June 12, 2021, 8:41pm

Just a side note: from a Sentence Collector perspective we can run migrations on the sentences in the database for whatever changes we’d like. However of course that alone would have negative effects on the CV website and dataset release. But I think it’s good to know that for the Sentence Collector we could do whatever is needed to keep consistency with what is needed for CV.

ftyers · June 12, 2021, 9:23pm

I would not mix and match sentences yet. I would stick with the original orthography until we come up with a solution. It would be good if you could file an issue on GitHub issues about this. It will be a good place to track it.

ftyers · June 12, 2021, 9:25pm

The problem (as I understand it) is that the key for the sentences is the hash of the original sentence. So if we change the sentence we change its hash and then all clips which were originally linked to that sentence are now not linked. Probably the easiest option is to have a table with an original sentence → current sentence mapping to track the updated orthographic forms.

Note that this is (marginally) relevant for German too with the ß/ss thing (and whatever they may come up with next).