Every sentence with character лј shoud be changed to љ in Serbian dataset

Fooftilly · May 25, 2021, 3:57pm

There is a problem with the conversion from Serbian Latin to Serbian Cyrillic script. Those who submitted those sentences probably overlooked it when they converted them. There aren’t many words that contain this character so it isn’t that big of a problem. Even so, it’s better to change it and get those sentences back in review, because in the Serbian dataset 70% of the sentences I marked as incorrect are marked incorrect because of this small error.
In Serbian Cyrillic, there is a character љ which is written as lj in Latin script and therefore considered as a two-letter character. During the conversion in Cyrillic, this is not considered as one character but two, and the problem arises there. The solution to this is to change every лј character to љ. There shouldn’t be any problem with this because those two characters never occur together outside of Latin script, but there this character isn’t considered as two letters but as one.

ftyers · May 26, 2021, 5:33pm

Yeah, I already fixed it in the script. I noticed it in the first import, but by that time they had already been imported. It’s a simple fix. So do you want me to go over the rejected sentences fix it and reimport them? Note it was also a problem with Нj. I fixed that as well.

ftyers · May 26, 2021, 5:35pm

There only appear to be four examples in the rejected sentences:

$ cat /tmp/foo | grep '[ЛН]ј'
    Нјега лече антибиотицима и витаминима.
    Лјуди чекају.
    Лјуди се хапсе.
    Нјен циљ?

Fooftilly · May 26, 2021, 7:13pm

Thank you. I thought there were more of those.