I’m a bit late to this thread, hopefully you’re still reading feedback.
My concern for you is about languages.
You have a definition for what you consider to be a language, but in the end, the definition alone may not help you. When people will start to request languages that are not so common, it will become difficult to know if they’re requesting a valid language.
(1) Sometimes it’s very hard to decide if something is a just “variant” rather than a whole language on its own. Unless you have a team of linguists working on categorizing languages, you probably won’t be able to decide properly.
(2) You may also face difficulties when accepting or rejecting constructed languages. You already have Esperanto, that one is easy to accept. But if someone would request Toki Pona, would you accept it or reject it?
Based on those two problems, in Tatoeba we ended up following the ISO 639-3 categorization. This helps us to decide what is a language, what is a dialect/variant, and which constructed languages are “officially” recognized as languages.
I can give you a concrete example of a difficult decision with Arabic.
- Based on the ISO 639-3 categorization, our contributors are allowed to request each of the languages listed under the Arabic macrolanguage.
- We had a request to add “Gulf Arabic” as a language: https://github.com/Tatoeba/tatoeba2/issues/1084
- Some people disagreed with this and argued that there is only one Arabic language and that adding Gulf Arabic will make the Tatoeba corpus messy.
- Someone counter-argued that there is linguistic evidence for separating Arabic into those many languages.
- Gulf Arabic is valid based on the ISO 639-3 language list, and we added it.
More recently we had issues with Berber and Kabyle. I won’t go into details for this one, this is just to say that Arabic was not the only difficult case.
Based on my experience, I would recommend that you check what’s available out there and get a predefined list of languages, so that can tell your contributors “These are the languages we acknowledge as languages” and don’t have to worry too much about drawing the line between languages, dialects, variants. Doesn’t have to be ISO 693, there’s maybe something else more suitable for you.
Even with a predefined list, it won’t spare you from lots of headaches, but at least it will give you a direction for deciding what languages you can accept.
Also, if it helps, here’s a snippet of Tatoeba’s instructions for language requests so that you have a concrete idea how we handle this:
Search for your language in the ISO 639-3 list of languages.
Please understand that if your language is not recognized in the ISO 639-3 standard, we cannot support it in Tatoeba. Language classification is a complex task and it is not part of Tatoeba’s mission. We rely on the ISO 639-3 standard to define what is a valid language.
There are some exceptions due to legacy reasons, but we will not make more exceptions.
If your language is missing in this standard, please contact the ISO 639-3 Registration Authority from their website: https://iso639-3.sil.org/.