blog image

Scaling Up Deep Learning Model Serving Using OpenCV

According to one research, 80% of the models built by data scientists never make it to production. The reason for this is that the production environment has several constraints. It could be inference time or it could be hardware in some cases. Hence, to make the model ready for production, we need to first think about the model itself that we will use for production.

For deciding the model different experiments are carried out and then a trade-off between the hardware and accuracy is compared. The accuracy term used here doesn’t only mean accuracy but includes any appropriate metric depending on the use case we are working on. 

In this article, we will explore model optimization for CPU environments.

Benefits of Model Optimization in Terms of Business

Reduces Deployment Cost

By doing model optimization, we can run our models efficiently using less memory and computational resources, which saves the cost of deploying our models in production.

Model Optimization Boosts Your Earning      

Model optimization reduces the latency of the model. Meaning that more requests could be served in less time. Meaning at the same deployment cost you can serve more users and get more revenue.

Why Use OpenCV to Serve Our Model?

Because It’s Fast and Memory-Efficient

OpenCV is fast and memory-efficient. Memory consumption is often low in comparison to other frameworks when we are doing inference using OpenCV. The inference speed is also fast. Even the models trained with darknet framework run faster with OpenCV as the cv2.dnn module is optimized for inference using Intel CPUs.

OpenCV is Optimized for Intel CPUs

Since the OpenCV was originally designed by Intel, it is optimized for doing inference with Intel CPUs.

Here in this case we will be optimizing an SSD mobile net model which has been trained on the coco dataset.

Solve Complex Business Problems with Machine Learning. Get Started.

Let’s Optimize Our Model

We need to perform the following steps in order to optimize our model.

  • Freezing and optimizing the model
  • Generating the pbtxt file for OpenCV prediction

Freezing and Optimizing the Model

Freezing converts the weights in form of variables to constants so we can freeze the model and also optimize it.

Fortunately, Tensorflow object detection API provides a single script for doing both things. The script is called This script performs the optimizations like stripping unused and identity nodes, removing dropouts. Quantization option is also provided but that type of optimization is not suited for CPUs and they don’t support float16 operations.

Although it is not true for all CPUs. For converting the model. We need to install the Tensorflow object detection API for Tensorflow 1.x and run the script with the following arguments.

pipeline_config_path: This is the path to the configuration file used for training the network.

trained_checkpoint_prefix: This is the path to the best checkpoint.

output_directory: The path where the optimized model will be stored.

The optimized model will be in protobuf(.pb) format.

Let’s say our trained model checkpoints and configuration is stored at the trained_checkpoints folder then we can do the conversion using the following command.

python object_detection/ \
       --pipeline_config_path trained_checkpoints/mobilenetv1.config \
       --trained_checkpoint_prefix trained_checkpoints/model.ckpt \
       --output_directory trained_checkpoints/optimized_model.pb

Generating the pbtxt File for OpenCV Prediction

In the case of TensorFlow models, the DNN module readNetFromTensorflow function expects both the protobuf(.pb) file which actually contains the weights and a configuration file which is in pbtxt format which contains the topology of the model. These configurations are called text graphs in technical terms. For writing text graphs OpenCV repository has some helper code.

In this case, as we selected to use the SSD model. The script we would be using is called It will be different in the case of RCNN models.

This script expects three arguments:-

input: This is the path to the optimized model

config: This is the path to the configuration file used for training the model.

output: This is the path where the pbtxt file will be saved.

Wow, the Script Was Great but Where to Find This Amazing Script?

The script can be found as follows:

Go to:

In this folder of repo:


Let’s say our optimized model resides in trained_checkpoints as well then we can generate pbtxt file using the following command:

python \
       --input trained_checkpoints/optimized_model.pb \
       --config trained_checkpoints/mobilenetv1.config \
       --output trained_checkpoints/model_conf.pbtxt

We Have Optimized Our Model. So What’s Next?

Now let’s roll up the curtains and see the magic which is happening behind.

Removal of Dropouts

Any deep learning practitioner who has trained any neural network might be familiar with dropouts. They are implemented as layers in some deep learning frameworks like TensorFlow. The dropouts randomly turn off a certain percentage of neurons during training hence preventing the model to overfit the data.

But during the inference, these are not needed and if they remain the neural network they will never be used and will still consume memory. Hence in this step, we remove those dropouts and make our model more efficient.

Removal of Unused and Identity Nodes

In some cases, there are some nodes in the model that never get used and they only increase the memory and computation footprint of the model and hence they have to be removed to optimize our model. There are also nodes in the model which just produce identity results and hence are redundant and can be removed.

Conversion of Variables to Constants

During the training time, the weights are in the form of variables. These weights are updated by backpropagation of the errors. But after training is done these weights have not to be changed hence there is no need to keep them as variables but instead, they could be converted to less memory-consuming constants.


During training, some weights values approach near zero. The neurons corresponding to those weights are never fired and hence are redundant. By removing those neurons we can drastically reduce the size of the network.


In quantization, we typecast the weights of the neural network to smaller data types. Like from float32 to float16 or int8. Quantization is hardware-specific i.e. some hardware support both float32 and float16 operation while others don’t. This reduces the size of the neural network and speeds up computation if these data types are supported on the target hardware.


The serving of the models can be done by the OpenCV DNN module. The model can be loaded with the readNetFromTensorflow function which expects the path to the optimized model as its first argument and the configuration path as other arguments.

Accelerate your Business with Machine Learning Services. Contact us now.


So in the article, we explored how simple model optimization could give huge returns in the long run. For instance, the speed gain we achieved using OpenCV for our detection model can make a huge difference in object tracking where we want to track objects in real-time. There is further hardware-related optimization that could be done for different hardware to increase the efficiency and performance even further.

1 Comment

  1. Pingback: david

Leave a Reply Protection Status