Iâm a bit late to this thread, hopefully youâre still reading feedback.
My concern for you is about languages.
You have a definition for what you consider to be a language, but in the end, the definition alone may not help you. When people will start to request languages that are not so common, it will become difficult to know if theyâre requesting a valid language.
(1) Sometimes itâs very hard to decide if something is a just âvariantâ rather than a whole language on its own. Unless you have a team of linguists working on categorizing languages, you probably wonât be able to decide properly.
(2) You may also face difficulties when accepting or rejecting constructed languages. You already have Esperanto, that one is easy to accept. But if someone would request Toki Pona, would you accept it or reject it?
Based on those two problems, in Tatoeba we ended up following the ISO 639-3 categorization. This helps us to decide what is a language, what is a dialect/variant, and which constructed languages are âofficiallyâ recognized as languages.
I can give you a concrete example of a difficult decision with Arabic.
- Based on the ISO 639-3 categorization, our contributors are allowed to request each of the languages listed under the Arabic macrolanguage.
- We had a request to add âGulf Arabicâ as a language: https://github.com/Tatoeba/tatoeba2/issues/1084
- Some people disagreed with this and argued that there is only one Arabic language and that adding Gulf Arabic will make the Tatoeba corpus messy.
- Someone counter-argued that there is linguistic evidence for separating Arabic into those many languages.
- Gulf Arabic is valid based on the ISO 639-3 language list, and we added it.
More recently we had issues with Berber and Kabyle. I wonât go into details for this one, this is just to say that Arabic was not the only difficult case.
Based on my experience, I would recommend that you check whatâs available out there and get a predefined list of languages, so that can tell your contributors âThese are the languages we acknowledge as languagesâ and donât have to worry too much about drawing the line between languages, dialects, variants. Doesnât have to be ISO 693, thereâs maybe something else more suitable for you.
Even with a predefined list, it wonât spare you from lots of headaches, but at least it will give you a direction for deciding what languages you can accept.
Also, if it helps, hereâs a snippet of Tatoebaâs instructions for language requests so that you have a concrete idea how we handle this:
Search for your language in the ISO 639-3 list of languages.
[âŠ]
Please understand that if your language is not recognized in the ISO 639-3 standard, we cannot support it in Tatoeba. Language classification is a complex task and it is not part of Tatoebaâs mission. We rely on the ISO 639-3 standard to define what is a valid language.
There are some exceptions due to legacy reasons, but we will not make more exceptions.
If your language is missing in this standard, please contact the ISO 639-3 Registration Authority from their website: https://iso639-3.sil.org/.