Has the time to load a model into memory changed in 0.7 as opposed to 0.5/0.6?

utunga · July 3, 2020, 12:09am

Just curious if folks are already aware of this?

In the 0.5.x binaries, it took approximately 8s to load the model. Once loaded, the inference time was 1.5 seconds to transcribe 7s of data (on our GPU).

With the 0.7.1 binaries, it’s taking approximately 82 seconds to load the model. Inference time is about the same.

See below for the logs using DeepSpeech 0.7.1 binaries/model (built locally, trained on our data). I don’t have the full logs for 0.5.x binaries (but I can provide more specific data to back this up if needed).

Or, it could be something specific to our environment, so just asking if others are aware of this difference already?

Loading model from file model/output_graph.pbmm
Loaded model in 83.8s.
Loading scorer from files model/lm.scorer
Loaded scorer in 0.000285s.
Running inference.
Inference took 1.315s for 6.997s audio file.
ka arohia katoatia te hāhi me ōna ƒakapono e te hapū o ōtākou

utunga · July 3, 2020, 6:45am

Update on this. FTR. I recompiled the DeepSpeech binaries for K80 (Compute Capability 3.7) and now it’s back to loading the model much faster.

Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:4 with 10691 MB memory) -> physical GPU (device: 4, name: Tesla K80, pci bus id: 0000:00:1b.0, compute capability: 3.7)

Loading model from file model/output_graph.pbmm
Loaded model in 1.87s.
Loading scorer from files model/lm.scorer
Loaded scorer in 0.000244s.
Running inference.
Inference took 1.566s for 6.997s audio file.
ka arohia katoatia te hāhi me ōna ƒakapono e te hapū o ōtākou

For clarity :

the first example above was running on V100 (p3.2xlarge instance) with a DeepSpeech 0.7.1_ish binary compiled with ENV TF_CUDA_COMPUTE_CAPABILITIES 6
the second example above, was running on K80 (p2.8xlarge instance) with the same exact code but the binary compiled with ENV TF_CUDA_COMPUTE_CAPABILITIES 3.7
(fwiw when i tried running the first binary on a k80 it ignored the GPU - Ignoring visible gpu device (device: 7, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7) with Cuda compute capability 3.7. The minimum required Cuda capability is 6.0 )

othiele · July 3, 2020, 7:29am

Great insight. Thanks for letting us know. Didn’t have that observation before, but we only load the model once at the beginning and maybe I didn’t realize the longer loading time. Have to check.

lissyx · July 3, 2020, 7:49am

I guess it’s just a side effect of us limiting the amount of compatibility prebuilt. We had to find a compromise, because the nodejs gpu package would become so big because of enabling too many compatibility versions that it would just crash the node runtime at package upload …

Topic		Replies	Views
Using pre-trained model DeepSpeech	13	1501	May 11, 2020
Is there a way to load the model only once? DeepSpeech	27	2554	October 9, 2018
Export the model DeepSpeech	2	410	May 3, 2019
DeepSpeech benchmarking / Shorten inference time DeepSpeech	16	5433	February 14, 2018
CPU inference never ends DeepSpeech	4	680	January 17, 2020

Has the time to load a model into memory changed in 0.7 as opposed to 0.5/0.6?

Related topics