Hi everyone, I am raising this thread for discussion around how to handle the GitHub respository(ies) for work in DeepSpeech in other languages.
The principles we are trying to satisfy here are:
- Reduce re-work required to get DeepSpeech working on a new language (reduce developer time)
- Reduce maintenance overhead
- Reduce support needs on this channel
- Reduce fragmentation of DeepSpeech work across GitHub
I can see the following options:
Option 1 - Fork the
DeepSpeech work from
commonvoice-fr to a new GitHub repo under the Mozilla organization
How it would work
- There would be a new repo for work in each language - ie
DeepSpeech-mrifor Te Reo Māori and so on. Each repo could be forked for work on a new language. For example, Te Reo Māori is close to Hawai’ian and other Pasifika languages, and
DeepSpeech-mriwould be a natural starting point for DeepSpeech work in those languages.
- Work on different languages is easily separated, and relevant NLP tools can be added for that language.
- Each language community is likely to be reasonably small, so merging commits is likely to be easier.
- Much harder to keep specific language work up to date. Each time
DeepSpeechis updated to a new version, each language-specific repo goes out of date. Requires a lot of maintenance to keep updated.
- It is likely that language-specific repos will fall behind the current version of DeepSpeech, and then that will incur a higher support load in this Discourse channel.
Option 2 - Push the Docker work from
commonvoice-fr into the PlayBook
Dockerfile.train file in the
commonvoice-fr has been heavily customized from the
Dockerfile.train that ships with
DeepSpeech, and uses a range of
bash scripts for the French version. It would likely be provided in the environment page of the PlayBook, which has a section on customizing Docker files.
- Many of the people using the PlayBook will want to train on languages other than English, which is the default that DeepSpeech is set up for. By providing the customized
Dockerfile.train, it will reduce their set up time.
- The PlayBook uses a the
Dockerfile.trainthat ships with
DeepSpeech. If the PlayBook instructs users to use the
commonvoice-frit will add additional complexity to the PlayBook.
- As people customize their
Dockerfiles, there isn’t really a way for the work that is done to be shared back with the community. That is, the work is likely to end up fragmented across many repos.
Option 3 - Create a directory structure within the
DeepSpeech repo that provides for multiple languages
Within the GitHub repo for
DeepSpeech, create directory structures that better accommodate different languages. For example, rename
alphabet-EN.txt and so on. The
Dockerfile would be changed to take a language as a parameter.
- Reduces maintenance overhead; if DeepSpeech is updated then all languages are updated at the same time.
- Makes it a lot easier to to start training in another language, which in turn reduces barriers for mobilizing language communities.
- Makes it a lot easier to encourage contributions to
DeepSpeechfrom language communities.
- This will require significant effort to achieve
- May make the
DeepSpeechpackage larger, which is something we want to avoid
- Significant effort would be required to merge in the various language work that has been done around DeepSpeech.
Discussion and comments warmly welcomed.