Hi everyone, I am raising this thread for discussion around how to handle the GitHub respository(ies) for work in DeepSpeech in other languages.
For example, @lissyx has done some excellent work in fr
using CommonVoice data. This work has been the basis of other projects such as Kabyle.
The principles we are trying to satisfy here are:
- Reduce re-work required to get DeepSpeech working on a new language (reduce developer time)
- Reduce maintenance overhead
- Reduce support needs on this channel
- Reduce fragmentation of DeepSpeech work across GitHub
I can see the following options:
Option 1 - Fork the DeepSpeech
work from commonvoice-fr
to a new GitHub repo under the Mozilla organization
Example
https://github.com/mozilla/DeepSpeech-fr
How it would work
- There would be a new repo for work in each language - ie
DeepSpeech-kab
for Kabyle,DeepSpeech-it
for Italian,DeepSpeech-mri
for Te Reo Māori and so on. Each repo could be forked for work on a new language. For example, Te Reo Māori is close to Hawai’ian and other Pasifika languages, andDeepSpeech-mri
would be a natural starting point for DeepSpeech work in those languages.
Pros
- Work on different languages is easily separated, and relevant NLP tools can be added for that language.
- Each language community is likely to be reasonably small, so merging commits is likely to be easier.
Cons
- Much harder to keep specific language work up to date. Each time
DeepSpeech
is updated to a new version, each language-specific repo goes out of date. Requires a lot of maintenance to keep updated. - It is likely that language-specific repos will fall behind the current version of DeepSpeech, and then that will incur a higher support load in this Discourse channel.
Option 2 - Push the Docker work from commonvoice-fr
into the PlayBook
Example
The Dockerfile.train
file in the commonvoice-fr
has been heavily customized from the Dockerfile.train
that ships with DeepSpeech
, and uses a range of bash
scripts for the French version. It would likely be provided in the environment page of the PlayBook, which has a section on customizing Docker files.
Pros
- Many of the people using the PlayBook will want to train on languages other than English, which is the default that DeepSpeech is set up for. By providing the customized
Dockerfile.train
, it will reduce their set up time.
Cons
- The PlayBook uses a the
Dockerfile.train
that ships withDeepSpeech
. If the PlayBook instructs users to use theDockerfile
work fromcommonvoice-fr
it will add additional complexity to the PlayBook. - As people customize their
DeepSpeech
Dockerfile
s, there isn’t really a way for the work that is done to be shared back with the community. That is, the work is likely to end up fragmented across many repos.
Option 3 - Create a directory structure within the DeepSpeech
repo that provides for multiple languages
Example
Within the GitHub repo for DeepSpeech
, create directory structures that better accommodate different languages. For example, rename alphabet.txt
to alphabet-EN.txt
and so on. The Dockerfile
would be changed to take a language as a parameter.
Pros
- Reduces maintenance overhead; if DeepSpeech is updated then all languages are updated at the same time.
- Makes it a lot easier to to start training in another language, which in turn reduces barriers for mobilizing language communities.
- Makes it a lot easier to encourage contributions to
DeepSpeech
from language communities.
Cons
- This will require significant effort to achieve
- May make the
DeepSpeech
package larger, which is something we want to avoid - Significant effort would be required to merge in the various language work that has been done around DeepSpeech.
Discussion and comments warmly welcomed.