Fascinating post Lukas, appreciate it. I have a couple questions if you don’t mind.
If you have an output_graph.pbmm (for a custom language model) that is 181M in size do you think it still might fit within the 250M overall limit you mentioned?
Sorry if this is a general purpose question about serverless, but how does the start up time work in this scenario. When hosting our models via traditional flask/docker method we find that a good few seconds required to load the model into memory and then a few more secs to do the actual transcription. So, obviously, we try to load it only once, then use it. Is that something that would happen ‘automatically’ in a serverless environment or are you looking at a model-load-time cost on every inference (every transcription) ??
How does serverless relate to GPU can you use serverless approach but specify elastic GPU in the back end?
Sorry for all the newbie questions! Appreciate any of your thoughts on this though and thanks again for writing this up!
PS I see that you can run things to keep your serverless lambdas ‘warm’ but I guess wondering if that ‘warmness’ is actually going to carry over to the extent of keeping the actual tensorflow weights model in memory somewhere…