Deploy and Serve AI Models (Part-1)

Published in

CodeX

4 min readJul 28, 2021

Model creation is just a step towards creating real-world AI solutions. Moreover, AI models need to deploy, host, and serve in order to run the predictions, detections, and classifications of the input(generally running the model inference ) in real-world scenarios. Most of the AI applications has the following workflows

Capture the input data based on the model. For example frames from a video source for object detection.
Perform the pre-processing on the input data. Most of the models expecting the input data in specific format or dimensions. example crop the image into a specific size, apply normalization or permute the dimension of the data, etc.
Run the model inference using the input data for the object detection, prediction, recognition, or classification
Perform the post-processing of the inference results. Sometimes needs to discards the data from the output based on the threshold or needs to choose the most appropriate bounding box(Non-Max Suppression) for the object detection results.

Inference of AI model

Deep learning inference is the process of running a trained DNN model to make detection, prediction, or classification against previously unseen data. So that deploying a trained DNN model to the proper platform is important in AI applications for inference as it demands more computing and memory power. AI application performance and latency ( response time from when you input data to the DNN until you receive a result) depends on the computing capacity of the deployed system. So most of the time AI trained models should be optimized(pruning and quantization) to reduce computing power and latency.

Model Inference can be done on-premise or in the cloud using either CPU or GPU. So that these can be accessed from desktop, mobile, or web applications or cloud services.

Furthermore, we will focus on the model Deployment/Inferencing. The deployment strategy should depend on the application type(desktop, web app, etc), mode(offline or online), or real-time and latency.

Offline Deployment of AI models

This approach is better if your application has limited access to the internet or network communication time is crucial in your application decision-making process. For example, a fire detection system should generate an alarm in real-time with low latency from a camera. In this case, AI models need to be shipped along with the application and run the inference on-premise.

The standalone application which can be run on, user PC or a single-board computer like nano jetson or raspberry pi is more suitable for this type. The main drawback of this approach is, the host machine should have a high computing capacity like GPU in order to speed up by parallelizing the deep learning execution.

QT or Python is a suitable framework for building these kinds of applications if you are planning to target cross platforms. QT is a powerful framework to build cross-platform standalone applications in c++ and also provide python bindings to build an application in a python environment. QT c++ is an apt choice if your application is performance-critical. Most of the ML frameworks like TensorFlow, PyTorch provides API in both python and c++ environments.

Online Deployment of AI models

A trained AI model can be deployed in the cloud or on-premise server and can be accessed from the client application for the model inference through a network connection. This approach is suitable for all types of applications but you need to consider network latency as compared to the offline application. This provides remote inferencing capability to the client applications by using HTTP/REST and GRPC protocol. So that users do not need to equip with heavy computing capacity like GPU and reduce the cost. There are a lot of service providers like AWS, Azure, GCP, etc provide cloud facilities to deploy and host your ML models. But I am focusing more on inference servers.

Tensorflow serving and Nvidia Triton Inference server are most popular inference server which allows hosting your models. However, the Triton server supports multiple frameworks like TensorFlow, TensorRT, PyTorch, ONNX Runtime, and even custom framework backends. Moreover, It supports the creation of customized python backends for inference serving. It provides flexibility to choose a framework for the projects. Triton high-performance inference allows to run models concurrently on GPUs to maximize utilization, supports CPU-based inferencing, and offers advanced features like model ensemble and streaming inferencing.

https://developer.nvidia.com/sites/default/files/akamai/ai-for-enterprise-print-update-to-triton-diagram-1339418-final-r3.jpg — Source Nvidia

Triton server Advantages

Load models from local storage or cloud platforms
Easily update models without restarting the server
Run multiple models from the same or different frameworks
Supports real-time inferencing and batch inferencing
Supports model ensemble

Conclusion

This topic just covered the basic deployment strategy of ML models to build the application for end-users. In the coming series, I will focus more on how to server ML model with inference servers such as Tensorflow serving and Nvidia Triton Inference server and how to accommodate post and pre-processing logic without a separate server(ensemble model).

Deploy and Serve AI Models (Part-1)

Inference of AI model

Offline Deployment of AI models

Online Deployment of AI models

Conclusion

Written by Rahul Thai Valappil