I take a note about listing languages, but there is probably a fair amount of work for us to analyze all languages in advance rather that analyzing them as they are requested.
You wouldn’t have to analyze all the languages in advance. The purpose is more that you wouldn’t have to take the responsibility of deciding how to organize this very complex thing that is human languages. You would delegate it to another entity, so that you can put more effort into designing and developing your tools.
By relying on an existing list of languages, you are more or less choosing a framework. There has been several attempts to define what a language is, and you don’t have to start this work from scratch.
You can still have your own definition of what is a language, that is tailored specifically for DeepSpeech, but that definition would have to remain internal to Common Voice and DeepSpeech. It’s not a definition you can easily impose on the rest of the world.
When you say “our technical needs to train DeepSpeech models require us to have a more restrictive definition on what a language is, that will differ from other definitions out there”, you have to be a little bit careful. With this approach you are asking users/contributors to adapt to your needs. You’ll be asking them to understand what a language is for you, rather than trying to understand what a language is for them.
I know how difficult it is to build software and I perfectly understand the rationale behind your approach. But I can tell you with a lot of confidence that the concept of language carries more than just a common set of words, grammar and writing system.
Language is, for a lot of people, something very tightly connected to their identity. It’s a facet of their culture, their history, their people. If you categorize their language in a different way that they perceive it, they won’t be happy with it or they will be confused about it.
If you want to be as inclusive as possible, if you want to cater diversity, you have to forget about technical requirements.
Iveskins mentioned Serbian and Croatian and it’s an interesting example. Based on your definition of a language, you might put Serbian and Croatian – and Bosnian – under one same language, just like you would put American and British English under a same language. Then you might sub-categorize this language into a Croatian accent, a Serbian accent, a Bosnian accent. But, if I may quote a Serbian who once wrote to us on this topic, “You will probably cause civil unrest if you would publicly put it as one language, from pure political reasons”.
Concretely if you were to add Serbian, Croatian and Bosnian to your supported languages, you’d probably prefer to present them as different languages from the user interface. But under the hood you could remap the data into one “technical language” with different accents, if that’s more useful for DeepSpeech. The way you organize the data before feeding it to DeepSpeech is, after all, your own business. But you cannot tell your contributors “We grouped these languages into one because it makes more sense for DeepSpeech”, that’s not something they will all appreciate.
It is not a bad thing if you prefer to handle your language requests case by case and build your own list of languages along the way. I don’t want to discourage you from it. Just be aware that you’d be doing linguistic work (and a difficult one). If that’s the path you choose, you probably want to involve linguists already at this stage, where you’re trying to define what a language is.
I don’t want to make you over-worried about it though. I’m pretty sure you can carry on with an intuitive and technically oriented definition of language. Many people will still be very enthusiastic to donate their voice regardless of how you define what a language is. They will be understanding and they will comply to your definition. But involving linguists now can save you from awkward situations later. Or at the very least, get you better prepared for these awkward situations.