What should be in the DeepSpeech Playbook?

kreid · December 17, 2020, 1:48pm

Firstly, a big hello! Or kia ora! Or habari ya asubuhi! Or selamat siang

I’m Kathy Reid, a part timer with Mozilla and an open source voice specialist. I’m working with Mozilla Fellow @Joshua_Meyer to put together a Playbook for DeepSpeech and would warmly welcome your input and feedback.

Learning how to train a speech to text model with DeepSpeech can have a high learning curve. Like the Common Voice Playbook, we’d like to put together a DeepSpeech Playbook. This will serve partly as a quick-start guide, partly as a set of tutorials, and partly as an on-ramp, allowing folx new to DeepSpeech to begin training speech models.

We’re anticipating that the Playbook will have the following broad sections:

Fundamentals of Speech-to-Text
Data collection
Data formatting
Model training
Model fine-tuning
Model testing (quantitative)
Model evaluation (qualitative)

As people using DeepSpeech - and wrangling some of its quirks and hurdles - every day, you will know best what content the Playbook needs to include. We’d love to hear from you about the specific challenges you face using DeepSpeech, particularly those encountered as you were training your first models.

What are the one or two things that you had to learn the hard way and wished there was a walkthrough for? Did you take notes that we might be able to use?

Please do let us know in the comments below. Be sure to specify;

the problem or hurdle you encountered
how you overcame it (or didn’t)
and what information or guidance would have been useful in overcoming the hurdle
and please point us to an example, or additional documentation if it’s available

We will share the Playbook openly once it is more developed, and anticipate it being under the same license as the Common Voice Playbook (CC-BY-SA 3.0).

A huge thank you in advance for helping us to help the DeepSpeech community, and may you and your loved ones remain safe and well as we progress through the pandemic.

Kind regards,
Kathy

othiele · December 17, 2020, 2:07pm

Great to have you on board @kreid. From my point of view we have many people struggling the with the basic concepts of deep learning as well as people missing a detailed walk through of how to prepare and train the data for the two use cases general ASR as well as limited (like keywords) vocabulary.

kreid · December 17, 2020, 2:50pm

Thanks so much @othiele, really appreciate all your contributions to this channel and your excellent advice on where to start.

nmstoker · December 17, 2020, 11:24pm

This is a great initiative @kreid

I imagine some of what I mention may already be the kinds of things you have in mind and have thought of.

Picking up on what @othiele says, i feel it is vital to start with a simplified high level description of the way DeepSpeech parts fit together and the various basic routes through them for different desired outcomes. Preferably with a few carefully crafted diagrams!

It seems worthwhile front loading it with guidance on what is / isn’t feasible, what the key skills and requirements are and how much transcribed audio is realistic. Would give the over optimistic an idea of how practical it is for their goals.

The first big problem I had was around Cuda version requirements. These conflicted with Cuda requirements for other projects I was working with, so it was a dilemma. I struggled a bit initially but eventually found it really useful to use Conda to let me switch into distinct environments with the Cuda version required installed as part of the environment so everything is nicely isolated. Regular virtual environments are good for isolating required packages but the Conda method seemed to help with Cuda (there may well be other ways to achieve the same but this proved robust and repeatable for me at least). I know Conda was generally advised against (as it wasn’t supported) but I was happy to take the risk I’d need to figure out certain issues myself in return for the overall benefit. Generally the rest of the installation was then handled via pip (ie in line with the DeepSpeech instructions)

A second early issue I had was with my expectations! I tried fine tuning and ran it for eight days, which seemed ages for me back then, only to find that the results were quite a bit worse in general use! I’d been using British English audio recorded myself, with the hope of tailoring it better to my own voice. In retrospect I had too little audio (I forget exact numbers but I think it was roughly six hours of fairly diverse sentences). The solution was to appreciate the learning curve, learn some perseverance and dig into the details / bits of the code / forum entries, so I could understand better. I also found that effort on language models was a great area to focus on for my aims (whilst I bolstered the amount of audio I recorded for later fine-tuning efforts).

Not sure if you want the playbook to be stand alone but guidance on how to interpret errors and how best to seek help /ask effective questions etc, if you can’t figure it out from looking at the error description, reading the playbook and digging into a little of the code.

reuben · December 18, 2020, 1:22pm

Thanks for this great initiative @kreid and @Joshua_Meyer!

In general one thing people seem to have trouble with is creating and maintaining clean environments. We try to document use of tools such as virtualenv in the docs but again and again people show up with problems due to mixing versions, using packages from alternative sources and so on.

I think it’s important to frame these processes as an iterative loop, not something you do sequentially. The best way to identify problems in your data collection and formatting early is by using the data. Beyond technical problems such as the format being incorrect it can also reveal problems in the collection design early, such as failing to cover some use case or dialect or accent you’re interested in.

An important component of model training that we don’t cover in our documentation is hyperparameter optimization. This is covered in materials teaching the basic concepts of deep learning so if we reference that as @othiele mentioned it could be enough but in general we see lots of people with reports of the type “I ran the tooling with default parameters and it didn’t work, is it broken?”.

nmstoker · December 19, 2020, 2:58pm

One other idea, that could be part of a playbook or could be external, is to put together a glossary for the project and related concepts, maybe giving definitions with a bit of specific context.

When people are starting they often don’t know the right words for things they’re trying or don’t recognise that is said (if it’s something a more experienced person takes for granted).

For instance, here’s a case where someone didn’t know code-switching and thought it applied to “code” (a perfectly reasonable mix up if they didn’t have a linguistic background)

kreid · December 19, 2020, 9:16pm

Thanks so much @nmstoker, this feedback is incredibly useful. Really appreciate the pointers on virtual envs - goodness knows I’ve run into those a bunch of times too

You raise an excellent point too about expectations, training and how much voice data is required for accurate speech recognition. Super helpful!

kreid · December 19, 2020, 9:16pm

A glossary is a brilliant idea!