Deploying TFLite model on GCP Serverless

Author:Murphy  |  View: 28473  |  Time: 2025-03-23 18:00:54

Model deployment is tricky; with the continuously changing landscape of cloud platforms and other AI-related libraries updating almost weekly, back compatibility and finding the correct deployment method is a big challenge. In today's blog post, we will see how to deploy a tflite model on the Google Cloud Platform in a serverless fashion.

This blog post is structured in the following way:

  • Understanding Serverless and other ways of Deployment
  • What is Quantization and TFLite?
  • Deploying TFLite model using GCP Cloud Run API
Img Src: https://pixabay.com/photos/man-pier-silhouette-sunrise-fog-8091933/

Understanding Serverless and other ways of Deployment

Let's first understand what do we mean by serverless because serverless doesn't mean without a server.

An AI model, or any application for that matter can be deployed in several different ways with three major categorisations.

Serverless: In this case, the model is stored on the cloud container registry and only runs when a user makes a request. When a request is made, a server instance is automatically launched to fulfill the user request, which shuts down after a while. From starting, configuring, scaling, and shutting down, all of this is taken by the Cloud Run API provided by the Google Cloud platform. We have AWS Lambda and Azure Functions as alternatives in other clouds.

Serverless has its own advantages and disadvantages.

  • The biggest advantage is the cost-saving, if you don't have a large user base, most of the time, the server is sitting idle, and your money is just going for no reason. Another advantage is that we don't need to think about scaling the infrastructure, depending upon the load on the server, it can automatically replicate the number of instances and handle the traffic.
  • In the disadvantage column, there are three things to consider. It has a small payload limit, meaning it can be used to run a bigger model. Secondly, the server automatically shuts down after 15 min of idle time, thus when we make a request after a long time, the first requests take much longer time than the consecutive ones, this problem is called Cold Start Problem. And lastly, there are no proper GPU-based instances yet for serverless.

Server instances: In this schema, the server is always up and you are always paying up the money even if no one requests our application. For applications with larger user bases, keeping the server up and running is important. In this strategy, we can deploy our apps in multiple ways, one way is to launch a single server instance that you scale manually every time the traffic increases. In practice, these servers are launched with the help of Kubernetes clusters which define the rule for scaling the infrastructure and do traffic management for us.

  • The biggest advantage is that we can work with the biggest-sized models and applications and get precise control over our resources, from GPU-based instances to regular instances. But managing and scaling these server instances properly is quite a big task and often requires a lot of fiddling. These can get super expensive for GPU-based instances since many AI models require GPU for faster inference.

Two great resources to understand Kubernetes and Docker:

Docker for dummies…

Tags: Ai Deployment Mlops Serverless TensorFlow Tflite

Comment