Thread: Discussion on how to manage repositories on GitHub for DeepSpeech work in other languages

Hi everyone, I am raising this thread for discussion around how to handle the GitHub respository(ies) for work in DeepSpeech in other languages.

For example, @lissyx has done some excellent work in fr using CommonVoice data. This work has been the basis of other projects such as Kabyle.

The principles we are trying to satisfy here are:

  • Reduce re-work required to get DeepSpeech working on a new language (reduce developer time)
  • Reduce maintenance overhead
  • Reduce support needs on this channel
  • Reduce fragmentation of DeepSpeech work across GitHub

I can see the following options:

Option 1 - Fork the DeepSpeech work from commonvoice-fr to a new GitHub repo under the Mozilla organization

Example

https://github.com/mozilla/DeepSpeech-fr

How it would work

  • There would be a new repo for work in each language - ie DeepSpeech-kab for Kabyle, DeepSpeech-it for Italian, DeepSpeech-mri for Te Reo Māori and so on. Each repo could be forked for work on a new language. For example, Te Reo Māori is close to Hawai’ian and other Pasifika languages, and DeepSpeech-mri would be a natural starting point for DeepSpeech work in those languages.

Pros

  • Work on different languages is easily separated, and relevant NLP tools can be added for that language.
  • Each language community is likely to be reasonably small, so merging commits is likely to be easier.

Cons

  • Much harder to keep specific language work up to date. Each time DeepSpeech is updated to a new version, each language-specific repo goes out of date. Requires a lot of maintenance to keep updated.
  • It is likely that language-specific repos will fall behind the current version of DeepSpeech, and then that will incur a higher support load in this Discourse channel.

Option 2 - Push the Docker work from commonvoice-fr into the PlayBook

Example

The Dockerfile.train file in the commonvoice-fr has been heavily customized from the Dockerfile.train that ships with DeepSpeech, and uses a range of bash scripts for the French version. It would likely be provided in the environment page of the PlayBook, which has a section on customizing Docker files.

Pros

  • Many of the people using the PlayBook will want to train on languages other than English, which is the default that DeepSpeech is set up for. By providing the customized Dockerfile.train, it will reduce their set up time.

Cons

  • The PlayBook uses a the Dockerfile.train that ships with DeepSpeech. If the PlayBook instructs users to use the Dockerfile work from commonvoice-fr it will add additional complexity to the PlayBook.
  • As people customize their DeepSpeech Dockerfiles, there isn’t really a way for the work that is done to be shared back with the community. That is, the work is likely to end up fragmented across many repos.

Option 3 - Create a directory structure within the DeepSpeech repo that provides for multiple languages

Example

Within the GitHub repo for DeepSpeech, create directory structures that better accommodate different languages. For example, rename alphabet.txt to alphabet-EN.txt and so on. The Dockerfile would be changed to take a language as a parameter.

Pros

  • Reduces maintenance overhead; if DeepSpeech is updated then all languages are updated at the same time.
  • Makes it a lot easier to to start training in another language, which in turn reduces barriers for mobilizing language communities.
  • Makes it a lot easier to encourage contributions to DeepSpeech from language communities.

Cons

  • This will require significant effort to achieve
  • May make the DeepSpeech package larger, which is something we want to avoid
  • Significant effort would be required to merge in the various language work that has been done around DeepSpeech.

Discussion and comments warmly welcomed.

Hi @kreid,
in general, I think this is a great idea.

I just wanted to notify you that I already did some work on this field with my DeepSpeech-Polyglot project:

Pros:

  • Currently supports 5 different languages (German, Spanish, French, Italian, Polish)
  • Covers the whole training process, with data preprocessing, language model building, training and exporting.
  • Adding support for new languages is also very easy, you just have to add a new alphabet_xx.txt file and extend the special words and character replacement file (langdicts.json).

Cons:

  • I will soon drop support of direct integration into DeepSpeech, because I’m trying to replace it with an improved network architecture (I’m not finished with it yet).
    I’m open for a full integration of the exported networks into DeepSpeech again, but this requires some effort, mainly in the native client code, and currently I don’t have the time for it (I already started a discussion about this here: Integration of DeepSpeech-Polyglot's new networks).
2 Likes

I’m in favour of Option #1, if we start out thinking of 1000-language support, then Option #3 is not really viable. Better to come up with a method of automatically updating when a new DS version comes out. In other projects it is done by having a core module and then language specific modules, the core module would include code related to providing an interface to the language specific modules. That way, updates could be made without breaking the end API.

1 Like

What about a combination of Option#1 and Option#3: Create one extra repository for all the language stuff?
This would prevent a fragmentation into 1000 different repos for 1000 languages, but I don’t think it will grow too big, because for most languages you only need to add a new alphabet file, as well as some readme steps for dataset specific instructions.

2 Likes

Thanks, this is what I’ve been advocating for.

We are really focusing on the work I conducted of a training pipeline here.

1 Like

What are the next steps here?

I have been collecting stuff about alphabets and validation scripts etc. for all the Common Voice languages here.

1 Like