In this post, we present the results of a model serving experiment made by Contentsquare scientists with an innovative DL model trained to analyze HTML documents. We show how the Amazon SageMaker TensorFlow Serving solution helped Contentsquare address several serving challenges.
Contentsquare is a fast-growing French technology company empowering brands to build better digital experiences. In their own words, “Our experience analytics platform tracks and visualizes billions of digital behaviors, delivering intelligent recommendations that everyone can use to grow revenue, increase loyalty, and fuel innovation.”
Contentsquare scientists developed several ML and deep learning models, and wanted to find solutions for cost-effective and performant real-time model serving. For this experiment, they chose a custom multi-input, multi-task deep neural network developed with TensorFlow-backed Keras, which can answer several questions in one single inference on large payloads consisting of HTML pages.
As a baseline deployment, the Contentsquare team served the TensorFlow-backed Keras model from a Flask server hosted on an Amazon Elastic Compute Cloud (Amazon EC2) p2.xlarge GPU machine. Flask is a popular Python web framework (52,000 stars on GitHub) appreciated for its simplicity and large community. The EC2 p2.xlarge instance was fitted with a NVIDIA Tesla K80 GPU card. On a reference input payload used as a benchmark, this design provided a single-request inference latency of approximately 5 seconds.
To reduce management overhead and get a simpler deployment experience, the Contentsquare team experimented with Amazon SageMaker. SageMaker is a managed service supporting the development lifecycle of custom models, from annotation up to production deployment and monitoring. Beyond enabling a faster time to market, SageMaker provides state-of-the-art open-source pre-written serving containers for XGBoost (container, SDK), Scikit-Learn (container, SDK), PyTorch (container, SDK), TensorFlow (container, SDK) and Apache MXNet (container, SDK). In particular, the SageMaker TensorFlow serving container is built on top of TensorFlow Serving (TensorFlow-Serving: Flexible, High-Performance ML Serving, Olston et al.), the official, high-performance serving stack for TensorFlow. The SageMaker team further improved the TensorFlow Serving experience by adding the option to run custom inference code in front of TensorFlow Serving (for example, for pre or postprocessing).
Slim Frikha, a Contentsquare scientist, says, “That is one of the reasons why we use TensorFlow Serving on SageMaker: TensorFlow Serving runs performant inference, SageMaker provides easy deployment, and the combination of both brings the extra possibility to do preprocessing and postprocessing with TensorFlow Serving.”
Preprocessing and postprocessing are important capabilities that ML practitioners look for when choosing an ML serving solution. To use the custom processing capacity of SageMaker TensorFlow Serving, developers can provide a custom inference.py script containing handling functions. For more information, see Create Python Scripts for Custom Input and Output Formats.
The following figures show a high-level view of the internal architecture of the current SageMaker TensorFlow Serving container. Two web servers are collocated in each instance of the endpoint instance fleet. An NGINX server handles the communication with the requesting client and can optionally run ad hoc data processing via an infererence.py script running in Gunicorn. A TensorFlow Serving server internally exposes TensorFlow models for consumption by the Gunicorn server. In-server communication between Gunicorn and TensorFlow Serving can be done in REST or gRPC when using an inference.py custom inference script, and with REST when using the default setup without the custom inference script. In both cases, external requests are done with REST.
The Contentsquare team tested both gRPC and HTTP for internal communication with TensorFlow Serving, and found gRPC to be much faster than HTTP, because HTTP required a JSON dump of the very large preprocessed input. On the specific benchmark inference payload, deploying in SageMaker TensorFlow Serving on an ml.p2.xlarge hosting instance reduced the global serving latency from 5 seconds to 3 seconds, compared to Keras deployed in Flask on Amazon EC2 p2.xlarge instance—a 40% improvement! This gain is driven by serving optimizations internal to TensorFlow Serving and decoding inputs to TensorFlow tensors, which can be faster if using gRPC.
Contentsquare scientists successfully completed their benchmark and found a cost-effective, high-performance serving solution for their custom TensorFlow model that reduced latency by 40% vs. a reasonable baseline. Another axis of improvement, not evaluated in this benchmark but worth consideration for extra gains, would be to evaluate different instance types. For example, the EC2 G4 instances, more recent than the P2, demonstrated great performance and economics in several inference cases. If you are interested in learning more about TensorFlow Serving on SageMaker, you can find guidance in the documentation, view the container source code on GitHub and navigate our examples gallery.
Olivier Cruchant is a Machine Learning Specialist Solutions Architect at AWS, based in Lyon, France. Olivier helps French customers – from small startups to large enterprises – develop and deploy production-grade machine learning applications. In his spare time, he enjoys reading research papers and exploring the wilderness with friends and family.