In the 0.5.x binaries, it took approximately 8s to load the model. Once loaded, the inference time was 1.5 seconds to transcribe 7s of data (on our GPU).
With the 0.7.1 binaries, it’s taking approximately 82 seconds to load the model. Inference time is about the same.
See below for the logs using DeepSpeech 0.7.1 binaries/model (built locally, trained on our data). I don’t have the full logs for 0.5.x binaries (but I can provide more specific data to back this up if needed).
Or, it could be something specific to our environment, so just asking if others are aware of this difference already?
Loading model from file model/output_graph.pbmm
Loaded model in 83.8s.
Loading scorer from files model/lm.scorer
Loaded scorer in 0.000285s.
Running inference.
Inference took 1.315s for 6.997s audio file.
ka arohia katoatia te hāhi me ōna ƒakapono e te hapū o ōtākou
Update on this. FTR. I recompiled the DeepSpeech binaries for K80 (Compute Capability 3.7) and now it’s back to loading the model much faster.
Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:4 with 10691 MB memory) -> physical GPU (device: 4, name: Tesla K80, pci bus id: 0000:00:1b.0, compute capability: 3.7)
Loading model from file model/output_graph.pbmm
Loaded model in 1.87s.
Loading scorer from files model/lm.scorer
Loaded scorer in 0.000244s.
Running inference.
Inference took 1.566s for 6.997s audio file.
ka arohia katoatia te hāhi me ōna ƒakapono e te hapū o ōtākou
For clarity :
the first example above was running on V100 (p3.2xlarge instance) with a DeepSpeech 0.7.1_ish binary compiled with ENV TF_CUDA_COMPUTE_CAPABILITIES 6
the second example above, was running on K80 (p2.8xlarge instance) with the same exact code but the binary compiled with ENV TF_CUDA_COMPUTE_CAPABILITIES 3.7
(fwiw when i tried running the first binary on a k80 it ignored the GPU - Ignoring visible gpu device (device: 7, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7) with Cuda compute capability 3.7. The minimum required Cuda capability is 6.0 )
Great insight. Thanks for letting us know. Didn’t have that observation before, but we only load the model once at the beginning and maybe I didn’t realize the longer loading time. Have to check.
1 Like
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
4
I guess it’s just a side effect of us limiting the amount of compatibility prebuilt. We had to find a compromise, because the nodejs gpu package would become so big because of enabling too many compatibility versions that it would just crash the node runtime at package upload …