Help preserving dialects from vanishing by allowing to add a dialect flag to spoken language

Dear Common Voice team and community

Thank you for initiating this amazing project with the open and privacy-friendly attitude of the mozilla foundation!
I think the use cases of this database could be pushed further by allowing to add dialect flags to the recorded languages. By dialects I mean regional differences in pronunciation (and words) within a language, like the difference between French from northern France and French from southern France (both spoken by native but regional speakers). This is not the same as French spoken with a German accent by a non-native speaker.
The reason why I think this is important is that dialects of many languages are dying out since the globalization process took off.

What do you think?

1 Like

Hello @amo and welcome to the Common Voice community!

We are currently working on a completely new approach to our accents strategy

The process is currently waiting for legal team to review the proposal and give us green light. We definitely want to capture how a language sounds different in different places :slight_smile:

Cheers .

1 Like

I think those should just be treated as different languages? We already have (but lack contributions to) Breton for example.

For example, you refer to north of France, I guess you were thinking about Ch’ti: has actually an ISO code pcd, so it could be completely possible for people speaking that to gather and start collecting under that new language.

thanks for the hint @nukeador !

@lissyx: Collecting varieties of a language as different languages is definitely a wrong approach as it violates the linguistic notion of language variaties used to systematically describe things like dialects…
Breton is a bad example for this because this one actually is a language of its own on the territory where French is the official language. The picard dialect (diatopic variety) in spoken French is still French language though.
Capturing Picard as a language (if there really still are notable native speakers…) just like Breton would be wonderful, if possible at all. But I was thinking more of capturing the picard dialect in spoken French, which is possible and should then be flagged as picard dialect in spoken French rather than as a new language.

1 Like

I think we need to be clear exactly about what is in your mind here. My understanding of your question is really about local languages that varies enough that they have a ISO code.

If we are just talking about a few local expressions used in specific areas, this is likely going to not also be just about accent. Speaking of accents, there’s always a need for them on french-speaking dataset.

Likely that @ftyers could shed some light here.

This gets very complicated here, because if you start to really have, for example, picard sentences in the french dataset, people might also get stuck / unable to act on those.

As long as it has an ISO code, I think nothing would stop you from doing it: localize common voice, provide text corpus. Of course, if you are not alone it’s better.

I’m not from the north of France, so I might miss the difference here between Picard as being recognized as a language, and what you qualify as picard dialect spoken in French.

My fear here is that this is a different yet similar need from the accent strategy. Free-form flagging is likely to end up a huge mess.

I wasn’t thinking of free-form flagging like putting hashtags to things. I think of creating subcategories for languages like the subcategory “Picard dialect” for language “French”.

As for the distinction of accent and dialect: If you are a German native speaker and after years of practice you speak French, but with the typical German pronunciation, then this is called an accent. With other words, someone from another country speaking your native language. If you are from southern France and a relative from the Picardie visits you, then you both speak the same language natively but your relative’s pronunciation and maybe a few special words will strike you and this is what is called dialect. With other words, someone with the same native language but from a different region, maybe like west and east coast american english.

1 Like

So yeah, similar to the (soon previous) accent strategy, and we know how it ends up as of now.

So here we have an accent

And here we have local variations.

IMHO, if those local variations are known from most (to be defined, french does not limit itself to the hexagon) of french speakers, I don’t think there is any point in flagging anything specific here.

Having those local variations in the dataset is still useful, though.

If those local variations are unknown from most non-picard (for the example) speakers, then I think it just falls into the category of a new language.

Also, considering your point of view, where do you draw the line to flag something as “Picard dialect” ? One word ? Two words ? A whole sentence ?

@amo Part of the new accent strategy is to better provide description. Currently for France, we lack that level that you describe: someone from the north or the south would be categorized as the same as someone from the center with mostly “no accent”. They all get into “Français de France”.

I need to emphasize that this is someting the Common Voice is aware of and wants to help about. And I think in your case, the best way is really to create a pcd effort.

Let me be devil’s advocate here. If you don’t have enough native speakers to be able to collect any data, then the sad truth is that your language is already nearly dead.

At the same time, preparing collection of data and promoting Common Voice can also be a way and a tool to try and help save it by gathering efforts, providing a central place to host, and giving visibility.

I don’t agree with your interpretations of the terms accent and dialect. I think it needs more linguistic expertise to find a good approach.
And to be honest, I am more than sure that people at linguistic departments in many universities worldwide would be happy to help. Collaboration efforts like that could fill a master’s thesis for a student in linguistics and the Common Voice project could benefit from real expertise in questions about how to grasp the varieties of languages.

My personal suggestion would therefore be to support what has been brought forward in the second post.

We already have some on board. More are obviously welcome.

Again, we are in touch with researchers in the field of endangered and undersources languages, and participated in several events around those topics.

Except that this is only to deal with accents, not what you point as dialects.

I don’t make any interpretation, I’m re-using your own words.
We already got feedback from contributor who had a hard time recording or validating in french because some of the early dataset we used was from the french parliament, and thus the vocabulary was not french as a daily, street-level.

I expect that mixing words from picard, to continue on this example, with pure french, while it does reflect a certain way to speak (and thus is a valuable addition) to the dataset is NOT the best way to achieve the collection of this dialect to help and preserve it.

That’s a tiny difference.

I don’t make any interpretation, I’m re-using your own words.

If you re-use my own words but with a different meaning, then we are writing about different things, aren’t we? You may call that meaning instead of interpretation, and that’s up to you.

And was is “pure french” supposed to be? There is no such thing and this is my whole point. Actually, the mere assumption that there is some sort of pure, normal or standard language is nothing but an imperialist idea (standard french is the variation of french spoken by the french people with the bigger cannons). Check out the history of french language.

Anyway, I think you are not qualified for this subject matter and I will not pursue this discussion.

Did I ever said « pure french » ?

That is escalating quickly. You go from a “mixing uncommon dialects in the data might make it less efficient for the dataset and not help your goal of preserving a dialect” to “imperialist that denies other languages”.

You are mixing multiple problems. All I’m saying is that if your goal is to preserve a dialect (and I again insist this is something we want to do with Common Voice), then I don’t think that the tool of “flaging” is the most efficient, and rather you should have a separate project dedicated to that dialect.

Anything else is pure litterature. We are not here to establish a “pure french” but to collect diversity of how french is spoken: and this is something we are trying to improve as currently there’s a huge bias regarding accents, we lack a lot of them.

@amo Thanks for your post and welcome to Common Voice project.

I’m a Kabyle from Algeria and I’m actively working on the Kabyle voice corpus but also on the French one from time to time since the beginning. I also contribute on localization since years.

In Kabylia, we also speak French and we have our own accent but we didn’t developed a local French language with our own words or our own grammar rules (morphology, syntax, phonetics/phonology, sentence structure…). We have our own french accent. For example, we mostly don’t pronounce “r” as it pronounced in Île-de-France and some other changing mostly in phonology. But we use the same words and same rules used by French speaking people in France, Canada, Belgium, Switzerland, North Africa, …and I can understand people speaking French even those from Mali, Niger, Senegal and Quebec too!!!

When there is no intercomprehension, there are different languages.

Once, I asked a linguist to figure me out the difference between a dialect and a language and here is his answer: A dialect is a language that has no state or army. So, for experts, a spoken language which, politically, is defined as a dialect, it’s really a language to linguists.

I prefer words like “variante régionale”, “parler régionale” when speaking about accents…

You can find all codified languages in the world on Ethnologue website (for Picard:

While accents deal with a variation in pronunciation and sounds, dialects deal really with languages that have their own rules. The rules can be inherited from a language family, a macro language or a “brother” language.

We’d like see Picard on Common Voice as Breton, Occitan, Corse, Normand, …


Because, Common Voice is the most open, free and the most largest Voice Database in the world. Common Voice deals with people real languages. It gives a chance to save minor/threatened languages around the world and more!!!

I’d like to see Picard on GPS tools. I’d like to command my computer in Picard. I’d like to command all connected object in Picard and in Kabyle too. :smiley:

Behind Common Voice, there are other tech projects (Deep Speech, TTS and others) and why not take advantage from them? these tech projects will use CV voice data to generate voice models for these languages. If we can get these voice models, we can use them on tools managing voice: GPS, Voice assistants on Pcs, voice typing, voice translation, video subtitling…

I encourage you to add Picard and We are here to give you help every time you need it. You just have to gather people to localize and collect sentences in Picard language.

I hope I can see Picard on CV as soon as possible.

1 Like

Thanks everyone participating in this conversation.

Common Voice current strategy about languages and accents is being reworked after months of work with linguists and technical people: đź—Ł Feedback needed: Languages and accents strategy (we expect this to be implemented mid-end this year).

Currently we are accepting new languages that are part of this standard unicode list.

For French we are capturing it as a dataset language and we plan to capture how different people speak French worldwide.

If you language is in the list I provided, it can be enabled following the usual procedure.

Please check both documents, and ask any questions if needed :slight_smile: