DeepSpeech model loading time - why is it so fast?

Something which is a pleasant surprise to me when using DeepSpeech is that the model loading time is extremely fast (instantiating the DeepSpeech Model from memory-mapped file).

Typically, I can get loading time in less than 20-30 milliseconds on a fairly standard cloud machine with non-SSD persistent disks.

I’m not a computer scientist and really have no understanding of memory-mapped files, but I’m wondering what allows the model to load so quickly, when the model file itself is ~190 mb? Surely the disk read speed is much lower than what the load time would suggest (it would need to be in the Gb/s). I’m also not seeing any lag upon the initial streaming recognition (which would suggest some sort of page fault lazy loading.

Any disk I/O, or OS expert can offer some explanation of this?

Thanks, I did read the wikipedia entry, and understand the ‘lazy-loading’ part. One thing I failed to mention in my initial question is that I also profiled the initial streaming recognize timing (first chunk of 320 ms audio), I was expecting that if the model is being read from disk as the location is being accessed, upon the first computation through the graph, there will be significant delays. However I did not see any additional lag.

Is it the case that the actual ‘read’ of the model is much less than what the file size suggest (190mb)?
I read about the heap usage here https://hacks.mozilla.org/2018/09/speech-recognition-deepspeech/ of only 20mb, maybe this explains parts of it.

I’m asking these questions because I’m curious to see if the model loading can scale to tens or hundreds of models on a single machine without pre-loading them into memory. For example, if I have 50 custom trained models, and I’d like to serve all of them in real time (when the streaming request comes in), the best case scenario is that there is no model loading and initial lag. My testing so far suggests that this is entirely possible, just want to know if I’m missing something or if there are pitfalls.

You should take a look at TensorFlow itself, we are relying on their capabilties to mmap efficiently, and this is why there is the convert_graphdef usage to adapt protocolbuffer file to be mmap efficient.

That depends on how much memory you have, I guess.