Adding the Occitan language: ideas and strategies


I am writing this message to get information about the process and to inform about what I plane to do to add the Occitan language.
I saw a conference in Tolosa, France about Common Voice last November. I think I understood the different steps but to be really sure I would like to list them and tell what nee to be done.

1)Translating the website

I started some times ago but alone it’s a bit too much. I would like to teach people how to participate. I already wrote and article about this (here:
In January I’m organising a translating session with, hopefully, 4 other persons. 2 proofreaders and 3 translators. At the moment there are 360 strings left to handle. We can make it :slight_smile:

2)Creating a corpus

A collection of at least 5 000 sentences is needed to achieve this step.
Not to brag about or anything but the Occitan language is written for more than 1000 years and has been an administrative language so more than 33 French departments are full of archives in this language. Lots of books are no longer under copyright, have they ever been.
I saw in the GitHub repository people gave some sayings and proverbs for their language, I’m already on it to gather some of them.
I will contact the Occitan multimedia library (Lo Cirdòc) and another public structures that developed online dictionaries, spellchecker, voice recording for Wikipedia (Lo Congrès).
In addition I will try to contact book editors and other associations to know if them can give as a gift some sentences.
Knowing that all these texts would be in a formal register I have an idea. Tell me what you think about it.

I envisage to build a website where people could drop 10 sentences of their own. I would like to give the address to people and ask them to write 10 sentences, spontaneous ones. Such as:
I went shopping for Christmas – I couldn’t go by train because of the strike - Etc. Daily and useful sentences. Maybe ask people to ask their friends to do so, and if they are not native in Occitan offer them to translate their friends’ sentences into Occitan.
I’m listing different teachers, singers, writers, and friends to solicit.

The last 2 steps, well we have time until then.

Thanks for your feedback!


Thanks for sharing and contributing to Common Voice.

As a reminder, this is the place to know everything about how to get a language launched:


Hello you all!
The translation of the web site is now complete :fireworks:
So the next step is creating a corpus from public sources, am I right?


That’s about right :slight_smile:

Bienvenu sur Common Voice. Heureux de retrouver l’Occitan.

Puisque vous venez de terminer la traduction du site, vous pouvez rejoindre le collecteur de phrases sur cette adresse:

  • Il faut d’abord créer un compte
  • Lire le contenu sur cette page pour comprendre la nature des phrases (licence CC0) et certaines règles
  • Vous pouvez héberger vos phrases sur Github par exemple, puis coller les phrases sur Sentence Collector et coller le lien en bas en guise de source.

En cas de besoin, n’hésitez pas à me solliciter.

Pour information, je suis impliqué sur le corpus de langue kabyle, une langue berbère nord africaine.


Coucou Belkacem, nos traductions se sont déjà croisées sur d’autres projets je crois :slight_smile:
J’ai bien une questions, comment on pourrait faire pour mettre dans des catégories les phrases ? J’avais vu un GitHub où les phrases étaient rangées comme cela.

En effet nous nous sommes croisé quelque part. Peut être sur un projet de trad auto ou sur academia?

Pour ta question, s’il s’agit de Common Voice, il n’y a pas de catégorisation. Avant de lancer l’outil Sentence Collector qui alimente Common Voice, nous utilisions Github pour envoyer des fichiers. Et c’est là où j’envoyais des fichiers par nature de phrases. Ce n’est plus le cas maintenant. Alors, si tu veux uploader des phrases, crée d’abord un projet sous licence CC0 sur Github et tu pourras mettre tes phrases dans des fichiers séparés. Sur Sentence Collector, tu copies le contenu de ces fichiers sur l’interface ainsi que le lien Github. Ces phrases seront exposées sur Common Voice de façon aléatoire.

Mais je suis impatient de voir l’Occitan enfin sur Common Voice. Cela m’aidera à m’entraîner d’abord sur sa phonétique/phonologie. Je suis amateur des langues populaires de France et d’Algérie :slight_smile:

This this conversation is moving into French, I’m moving it to the French category to avoid noise in our main English category, merci!

