Thanks for the feedback @gadda and welcome to the community!
Can you provide a few languages as examples where this happens?
Thanks!
Thanks for the feedback @gadda and welcome to the community!
Can you provide a few languages as examples where this happens?
Thanks!
I see feedback is closing soon. I have time for a few thoughts about writing systems. (Thereâs much to say about âthe same words and grammarâ, too.)
Why include a writing system in the definition of a language? In line with @gadda, many languages do not have one writing system. Reasons for this can include:
Examples of 4: Mongolian, Serbian. And of the last: ASL.
Moreover, scripts have variants. Do âSimplifiedâ and âTraditionalâ Chinese characters represent different languages? What about various scripts for writing Assyrian Aramaic? Or, looking back in history, what about Egyptian, or the Tamil Brahmic system variants layered onto the language? I donât know which cases matter to these projects, but theyâre relevant linguistically.
Iâm a bit late to this thread, hopefully youâre still reading feedback.
My concern for you is about languages.
You have a definition for what you consider to be a language, but in the end, the definition alone may not help you. When people will start to request languages that are not so common, it will become difficult to know if theyâre requesting a valid language.
(1) Sometimes itâs very hard to decide if something is a just âvariantâ rather than a whole language on its own. Unless you have a team of linguists working on categorizing languages, you probably wonât be able to decide properly.
(2) You may also face difficulties when accepting or rejecting constructed languages. You already have Esperanto, that one is easy to accept. But if someone would request Toki Pona, would you accept it or reject it?
Based on those two problems, in Tatoeba we ended up following the ISO 639-3 categorization. This helps us to decide what is a language, what is a dialect/variant, and which constructed languages are âofficiallyâ recognized as languages.
I can give you a concrete example of a difficult decision with Arabic.
More recently we had issues with Berber and Kabyle. I wonât go into details for this one, this is just to say that Arabic was not the only difficult case.
Based on my experience, I would recommend that you check whatâs available out there and get a predefined list of languages, so that can tell your contributors âThese are the languages we acknowledge as languagesâ and donât have to worry too much about drawing the line between languages, dialects, variants. Doesnât have to be ISO 693, thereâs maybe something else more suitable for you.
Even with a predefined list, it wonât spare you from lots of headaches, but at least it will give you a direction for deciding what languages you can accept.
Also, if it helps, hereâs a snippet of Tatoebaâs instructions for language requests so that you have a concrete idea how we handle this:
Search for your language in the ISO 639-3 list of languages.
[âŠ]
Please understand that if your language is not recognized in the ISO 639-3 standard, we cannot support it in Tatoeba. Language classification is a complex task and it is not part of Tatoebaâs mission. We rely on the ISO 639-3 standard to define what is a valid language.
There are some exceptions due to legacy reasons, but we will not make more exceptions.
If your language is missing in this standard, please contact the ISO 639-3 Registration Authority from their website: https://iso639-3.sil.org/.
My first language is New Zealand English. It has a common writing system with most other Englishes but also has some unique spelling/sound correspondences. These are due to the intermixing with NZs other common language, Maori.
There are many common words borrowed from Maori in New Zealand English. These words are written with the Maori spelling/sound correspondences. E.G âWhakataneâ is said fah-kah-TAH-nÉ, not wack-a-tain. Ngaruawahia = [ĆaËÉŸÊaËwaËhia] with the NG being a nasal sound like suNG âŠ
Macros can sometimes also be used for long vowels in MÄori. Macros are kept when writing in english.
To complicate things further⊠There are a wide range of real life ways that non-native and uneducated speakers of Maori or NZ english will pronounce such words. If you showed the prompt ânavigate to Whatataneâ to someone with no experience with New Zealand English, it would be hard for them to know how to pronounce it.
So both âcorrectâ and incorrect pronunciations are in common use. If someone wanted to make a voice controlled GPS app that could be used by international tourists, and Locals a like then it would be important to capture all these data points.
Google and Vodafone NZ have made a presumably private dataset described here.
https://news.vodafone.co.nz/article/new-zealanders-highlight-te-reo-maori-names-be-updated-google-maps
If you are making an app that is for transcription of text and it is being used by somone outside of NZ , you probably dont want Fah-Kah correspond to the letters Whaka. So this data set needs to be separatable from other Englishes.
So there are a plurality of Englishes around the world which share many but not all words. If each language has just one dataset, will the unique features of each country be left out, or all mixed together. Neither seem desirable. Or will there be a great duplication of words where there is overlap. Also not the best.
Say New Zealand English and Australian English are 95% similar in terms of words and grammar. New Zealand and British are 90% a like and New Zealand and American are also 90% a like but in different ways. Do we need to collect completely different sentence sets for NZ, AUS, UK, US?
If not, what does an American do with the prompt âRangatotoâ or a NZer with âArkansasâ for that matter.
When I talk with English speakers from Kenya or India, they have their own unique set of words and grammars too. These cannot simply be accounted for as accents or informal language.
This study has some good examples of the differences
http://archive.gameswithwords.org/WhichEnglish. ⊠It might need the way back machine to read now. âthe dog was chased the cat.â Etc.
(Also Does grammar even matter to an agent trained on sound files / text chunks?).
My current second language is Japanese. I am forever embarrassed by Amazon Alexaâs refusal to understand a word I say even when humans have no problem. I wouldnât consider myself near native, but I defiantly think my accent is influenced by the region I live in. This is the language community I participate in to become a Japanese speaker, so of course I pick up itâs habits.
As for ascents. I donât think defining by cities is a good idea. Firstly because about half the world doesnât live in one (yet). Rural people are already underserved by technology, I would hesitate to choose categories that by design make something less useful to them.
Secondly, because accent is more about language communities, maybe⊠There is difference based on, age, class, education level, ethnicity, also. Common description of English accents usually have a Cultivated variant, because people like to show how educated they are by changing up their vowels.
Accent is of course related to how different speakers move from graphemes (signs) to morphemes(mental) to phonemes (sounds) , this is segmental. There is also suprasegmental elements to accents, stress, intonation, prosody, pitch. These seem to be missing from this definition.
Sorry of thereâs mistakes here Iâm no expert. Also itâs hard to write on a phone.
https://i.imgur.com/56VgVmP.jpg Indeed. Which is Serbian, which is Croatian⊠I donât want to be the one to sayâŠ
For our practical purposes, yes. Have in mind our app is displaying text for people to read, and that text is then matched with voices and passed to a machine learning system.
We need to organize different writing systems in different datasets, thatâs why we consider languages bases on a common writing system.
When in doubt, we will request the help of a linguistic expert to decide.
Please note that Common Voice project needs might not be the same as others. As I commented previously, our technical needs to train DeepSpeech models require us to have a more restrictive definition on what a language is, that will differ from other definitions out there.
I take a note about listing languages, but there is probably a fair amount of work for us to analyze all languages in advance rather that analyzing them as they are requested.
Note that while English is one dataset, the accents metadata would allow us to create sub-dataset based on accent for example, so it would be possible to create a dataset for English in New Zeland if we have enough voices from there. Also note that the first priority is to be able to understand English, the possibility to adapt to local accents is an extra thing we will be able to do, but first we need to the to the first one
People wonât have to select a city if they donât want to. They will be always able to select region or country.
I take a note about listing languages, but there is probably a fair amount of work for us to analyze all languages in advance rather that analyzing them as they are requested.
You wouldnât have to analyze all the languages in advance. The purpose is more that you wouldnât have to take the responsibility of deciding how to organize this very complex thing that is human languages. You would delegate it to another entity, so that you can put more effort into designing and developing your tools.
By relying on an existing list of languages, you are more or less choosing a framework. There has been several attempts to define what a language is, and you donât have to start this work from scratch.
You can still have your own definition of what is a language, that is tailored specifically for DeepSpeech, but that definition would have to remain internal to Common Voice and DeepSpeech. Itâs not a definition you can easily impose on the rest of the world.
When you say âour technical needs to train DeepSpeech models require us to have a more restrictive definition on what a language is, that will differ from other definitions out thereâ, you have to be a little bit careful. With this approach you are asking users/contributors to adapt to your needs. Youâll be asking them to understand what a language is for you, rather than trying to understand what a language is for them.
I know how difficult it is to build software and I perfectly understand the rationale behind your approach. But I can tell you with a lot of confidence that the concept of language carries more than just a common set of words, grammar and writing system.
Language is, for a lot of people, something very tightly connected to their identity. Itâs a facet of their culture, their history, their people. If you categorize their language in a different way that they perceive it, they wonât be happy with it or they will be confused about it.
If you want to be as inclusive as possible, if you want to cater diversity, you have to forget about technical requirements.
Iveskins mentioned Serbian and Croatian and itâs an interesting example. Based on your definition of a language, you might put Serbian and Croatian â and Bosnian â under one same language, just like you would put American and British English under a same language. Then you might sub-categorize this language into a Croatian accent, a Serbian accent, a Bosnian accent. But, if I may quote a Serbian who once wrote to us on this topic, âYou will probably cause civil unrest if you would publicly put it as one language, from pure political reasonsâ.
Concretely if you were to add Serbian, Croatian and Bosnian to your supported languages, youâd probably prefer to present them as different languages from the user interface. But under the hood you could remap the data into one âtechnical languageâ with different accents, if thatâs more useful for DeepSpeech. The way you organize the data before feeding it to DeepSpeech is, after all, your own business. But you cannot tell your contributors âWe grouped these languages into one because it makes more sense for DeepSpeechâ, thatâs not something they will all appreciate.
It is not a bad thing if you prefer to handle your language requests case by case and build your own list of languages along the way. I donât want to discourage you from it. Just be aware that youâd be doing linguistic work (and a difficult one). If thatâs the path you choose, you probably want to involve linguists already at this stage, where youâre trying to define what a language is.
I donât want to make you over-worried about it though. Iâm pretty sure you can carry on with an intuitive and technically oriented definition of language. Many people will still be very enthusiastic to donate their voice regardless of how you define what a language is. They will be understanding and they will comply to your definition. But involving linguists now can save you from awkward situations later. Or at the very least, get you better prepared for these awkward situations.
Thanks for bringing this perspective, I understand what you mean. Iâll have to check with the team about what is possible at a technical level, maybe there is a way to solve this for the languages where working with linguistics we determine itâs the best thing to do.
We will also have to check if there is a list of languages closer to our definition we can rely on, as I said, this proposal was created based on work one of our linguistic experts.
Thanks again for your input!
I just want to say that in case of Basque language, choosing a region is OK. Choosing a city wouldnât work and for some speakers would be confusing. Accent regions and political regions donât match, so some people could choose the nearest city or the city into their political region and would do a wrong election, because the Basque accent regions arenât distributed that way. If people with different accents choose the same city, data will be mixed. So, if city choosing is optional, Basque will keep using regions just as it already does (preferably without a city list, just to avoid user confusion).
Basque is mainly spoken in two countries, France and Spain. Two of the accents are used in the french side (in two different regions) and three accents in the spanish side (in three different regions). So combining countries and regions would be possible too, but in Basque language there arenât so many regions, so just choosing one region from a simple list is enough.
Hi, Iâm a bit late to the party here, but Iâd like to offer a linguistâs point view.
TLDR; we should not crowdsource these definitions. Incorporate academic resources instead.
Example: Norwegian is a language group, the varieties are largely mutually intelligible with each other, and with Swedish. The two formalised standards are BokmĂ„l (which half-jokingly is a koineized version written in Danish), and Nynorsk (an imaginary proto-version), both of which no-one âreally speaksâ.
Comparable situation with Finnish. The official version that has a standard is an invention by amalgamating features from natural varieties, itâs highly constructed (though can be spoken).
Example: dialect continuum.
Consider:
ENGLISH: I am the son of my father and my mother.
SCOTS: A am the son o ma faither an ma mither.
FRISIAN: Ik bin de soan fan myn heit en myn mem.
DUTCH: Ik ben de zoon van mijn vader en mijn moeder.
Consider:
The Balkan example mentioned above.
What are the recommendations then?
For high-resource languages that have standard bodies, the meta-data should designate speaker status of whether they are producing the standardised variety, e.g. a ânativeâ English speaker, who can either use the General American, or Standard Southern British
For regional varieties, the meta-data should designate native speakers of a variety, as defined by widely established dialectology.
Non-native speech should be labelled as such. There are varying levels of âaccentednessâ, from highly consistent L1-interference (in this case, you may say that the speaker has created a merged internal phonology in the process), to rampant lexical errors (e.g. using wrong tone or quantity as a result of having no control over phonemic contrast).
Now in terms of ASR, conventionally there are two models: the acoustic models and the language model. At some point it may be helpful to also have a separate phonology model: e.g. which phonemes can occur together, how they change into allophones in different contexts, or in the case of non-native phonology, substitutions etc.
In practical terms, what crowdsourced questions would be useful for describing the speech production itself. I imagine independent of the language/variety designation, we can get meaningful self-reported information along several axes:
Stable-Unstable
âWhen you speak this language, how stable is your accent over time?â
This goes from a native variety-speaker, to say, cosmopolitan Finns who speaks convincing TV English, but the accent varies widely from week to week.
Cf. https://www.phonetik.uni-muenchen.de/~jmh/research/papers/harrington00.nature.pdf
Convincing-Accented
âHow do others (especially native speakers) perceive your accent?â
This only applies when you are aiming for an idealised target, e.g. an actor in a film playing a speaker of some other variety. Note: here âaccentedâ may be a bit misleading, since âconvincingâ is relative to the selected target, for example, when there is a consistent L1-mediated non-native phonology, the actor can put on a convincing âRussian accentâ in English.
Regional-Koineized
âAre you speaking a variety that is used when regional locals talk to each other?â
When Glaswegians talk to Glaswegians, the production may be different from when they talk to New Zealanders.
Do you have any empirical evidence that people are not able to self-classify their accents and that the subsequent classifications are not useful for the task of producing targetted speech recognition?
This is more or less the right approach.
Sorry everyone for taking so long to provide an update here.
We have been analyzing all your feedback, as well as consulting with more linguistic experts, both online and in person.
We are currently getting agreement on the final proposal, that Iâll share here as soon as itâs ready. Itâs currently leaning towards a less restrictive approach (as lot of you asked for).
Thanks for your patience.
October update:
We are still waiting to have a few more conversations with linguistics (sorry, this is taking way longer than we expected) and we have also have been trying to balance the current proposal so we bring value to both product and linguistic researchers using our dataset.
A lot of this project has been learning as we go, and the complexity of providing value to everyone is higher than we initially though.
My plan is to be able to come back with a recommendation by the end of this month.
Thanks all for your understanding!
Hi @txopi
Itâs the same with Kabyle language.
When I met some members from the Garabide foundation we talked about the different dialects and accents. They donât correspond to the administrative repartition. Itâs the same with Kabylia.