Optimizing PyTorch Model Serving at Scale with TorchServe

Deploying deep learning models in production is challenging, especially in a headless environment. Obstacles like scalability, latency, and resource management can slow a retailer’s effort to deliver seamless AI-powered shopping experiences. At UpStart Commerce, we’ve addressed these challenges head-on. By developing our UC Visual Search APIs, we’ve created scalable and easy-to-use headless APIs that simplify the complexities of deploying deep learning models while ensuring exceptional performance.

With the rise of mobile devices, ecommerce users interact with online stores in various ways. Features that improve the overall shopping experience, like visual search, are essential for online shoppers. Serving PyTorch models in production, like the Fast R-CNN object detection model, introduces scalability, latency, and resource management complexities.

So, how can you serve your PyTorch models to handle growing user demand without sacrificing performance? Efficiently serving deep learning models is crucial in scalable PyTorch model-serving solutions.

This article discusses the need for scalable PyTorch model serving and demonstrates how TorchServe effectively addresses these demands, using the Fast R-CNN model as a real-world example. We’ll also compare the deployment of PyTorch models using TorchServe and FastAPI, sharing insights from our evaluation in the Comparing TorchServe vs FastAPI Model Serving Options section.

What is TorchServe?

TorchServe is an open-source model serving framework specifically designed for PyTorch models. Developed collaboratively by PyTorch and AWS, it simplifies the process of deploying trained models without requiring custom code for model loading and inference. By handling the underlying complexities, TorchServe provides a flexible and extensible platform that supports CPU and GPU environments, making it suitable for various production scenarios. Key features include easy model deployment through a streamlined workflow, support for multi-model serving, capabilities for model versioning, and built-in tools for monitoring and logging, These features are essential for tracking model performance and resource utilization. By leveraging TorchServe, developers can focus on building high-quality models while relying on a robust serving infrastructure that ensures scalability, efficiency, and reliability in production environments.

TorchServe Workflow

Image source: https://aws.amazon.com/pytorch/

TorchServe simplifies the deployment of trained PyTorch models by wrapping them in RESTful APIs, enabling scalable and low-latency model serving. The key steps in the deployment workflow are:

Workers: 1
Parallel Requests: 20
Total Requests: 500

Diagrams illustrating the performance metrics for Scenario 2:

Analysis:

FastAPI outperformed TorchServe in average and median response times under moderate concurrency with a single worker.
The lower minimum time for FastAPI indicates quicker response for some requests, but the maximum times are closer, suggesting that both frameworks experience delays under increased load.
TorchServe’s higher response times highlight the need for multiple workers to handle concurrency effectively.

Scenario 3: Multiple Workers, Increasing Concurrency

Configuration:

Workers: 5
Parallel Requests: 1, 20, 50, 100, 500
Total Requests: 500

Diagrams illustrating the performance metrics for Scenario 3:

Performance Metrics with 5 Workers (Mode Time)

Performance Metrics with 5 Workers (Median Time)

Performance Metrics with 5 Workers (Average Time)

Performance Metrics with 5 Workers (Maximum Time)

Performance Metrics with 5 Workers (Minimum Time)

Analysis:

TorchServe:
- - Demonstrated increased average response times with higher concurrency, as expected.
- - Maintained relatively stable maximum times compared to FastAPI, indicating better handling of high concurrency when multiple workers are used.
- - Surprisingly, at 500 parallel requests, TorchServe’s average response time decreased significantly. This could be due to efficient batching or internal optimizations kicking in under extreme load.

FastAPI:
- - Showed lower minimum times across all concurrency levels, highlighting its ability to quickly handle individual requests.
- - Experienced significant increases in maximum and average response times at higher concurrency levels, indicating potential performance bottlenecks.
- - At 500 parallel requests, FastAPI’s average response time exceeded 7 seconds, which is unacceptable for real-time applications.

Key Observations

Consistency: TorchServe provided more consistent response times, especially under high load, due to its optimized request handling and worker management.

Scalability: TorchServe scaled better with increased concurrency when configured with multiple workers. FastAPI’s performance degraded significantly at higher concurrency levels.

Resource Utilization: TorchServe’s ability to handle more requests with fewer resources can lead to cost savings in production environments.

FastAPI’s Strengths: At low concurrency, FastAPI was competitive and sometimes faster in terms of minimum response time, making it suitable for applications with low to moderate traffic.

Recommendations

High-Concurrency Applications: Use TorchServe when expecting high levels of concurrent requests. Its architecture is better suited to scale horizontally with multiple workers.

Low to Moderate Traffic: Consider FastAPI for applications with predictable, low to moderate traffic, where its simplicity and flexibility can be advantageous.

Worker Configuration: Properly configure the number of workers in TorchServe based on expected load to optimize performance.

Monitoring and Tuning: Continuously monitor performance metrics in production and adjust configurations as necessary for both TorchServe and FastAPI.

Our evaluation indicates that TorchServe is generally more robust and scalable for serving PyTorch models in production environments, especially under high concurrency. TorchServe’s architecture is optimized to handle multiple concurrent requests efficiently, providing consistent and reliable performance.

On the other hand, FastAPI offers simplicity and speed for applications with lower concurrency demands. FastAPI is a good choice when ease of development and integration with other Python web applications is a priority.

Ultimately, the choice between TorchServe and FastAPI should depend on your application’s specific requirements, such as traffic patterns, resource availability, and performance goals.

FastAPI had a slightly lower minimum and average response time, indicating faster individual request handling in this minimal load scenario.
TorchServe showed more consistent performance with a smaller gap between minimum and maximum times, suggesting better reliability.
FastAPI exhibited significant variability with a high maximum time, possibly due to occasional latency spikes.

Scenario 2: Single Worker, Moderate Concurrency

Configuration:

Workers: 1
Parallel Requests: 20
Total Requests: 500

Diagrams illustrating the performance metrics for Scenario 2:

Analysis:

FastAPI outperformed TorchServe in average and median response times under moderate concurrency with a single worker.
The lower minimum time for FastAPI indicates quicker response for some requests, but the maximum times are closer, suggesting that both frameworks experience delays under increased load.
TorchServe’s higher response times highlight the need for multiple workers to handle concurrency effectively.

Scenario 3: Multiple Workers, Increasing Concurrency

Configuration:

Workers: 5
Parallel Requests: 1, 20, 50, 100, 500
Total Requests: 500

Diagrams illustrating the performance metrics for Scenario 3:

Analysis:

TorchServe:
- - Demonstrated increased average response times with higher concurrency, as expected.
- - Maintained relatively stable maximum times compared to FastAPI, indicating better handling of high concurrency when multiple workers are used.
- - Surprisingly, at 500 parallel requests, TorchServe’s average response time decreased significantly. This could be due to efficient batching or internal optimizations kicking in under extreme load.

FastAPI:
- - Showed lower minimum times across all concurrency levels, highlighting its ability to quickly handle individual requests.
- - Experienced significant increases in maximum and average response times at higher concurrency levels, indicating potential performance bottlenecks.
- - At 500 parallel requests, FastAPI’s average response time exceeded 7 seconds, which is unacceptable for real-time applications.

Key Observations

Consistency: TorchServe provided more consistent response times, especially under high load, due to its optimized request handling and worker management.

Scalability: TorchServe scaled better with increased concurrency when configured with multiple workers. FastAPI’s performance degraded significantly at higher concurrency levels.

Resource Utilization: TorchServe’s ability to handle more requests with fewer resources can lead to cost savings in production environments.

FastAPI’s Strengths: At low concurrency, FastAPI was competitive and sometimes faster in terms of minimum response time, making it suitable for applications with low to moderate traffic.

Recommendations

High-Concurrency Applications: Use TorchServe when expecting high levels of concurrent requests. Its architecture is better suited to scale horizontally with multiple workers.

Low to Moderate Traffic: Consider FastAPI for applications with predictable, low to moderate traffic, where its simplicity and flexibility can be advantageous.

Worker Configuration: Properly configure the number of workers in TorchServe based on expected load to optimize performance.

Monitoring and Tuning: Continuously monitor performance metrics in production and adjust configurations as necessary for both TorchServe and FastAPI.

Ultimately, the choice between TorchServe and FastAPI should depend on your application’s specific requirements, such as traffic patterns, resource availability, and performance goals.

Workers: 1
Parallel Requests: 1
Total Requests: 500

Performance Metrics:

Approach	Min Time	Max Time	Average	Median	Mode
TorchServe	35.342ms	55.192ms	38.569ms	38.082ms	36.510ms
FastAPI	29.045ms	1276.5ms	34.822ms	31.326ms	33.128ms

Analysis:

FastAPI had a slightly lower minimum and average response time, indicating faster individual request handling in this minimal load scenario.
TorchServe showed more consistent performance with a smaller gap between minimum and maximum times, suggesting better reliability.
FastAPI exhibited significant variability with a high maximum time, possibly due to occasional latency spikes.

Scenario 2: Single Worker, Moderate Concurrency

Configuration:

Workers: 1
Parallel Requests: 20
Total Requests: 500

Diagrams illustrating the performance metrics for Scenario 2:

Analysis:

FastAPI outperformed TorchServe in average and median response times under moderate concurrency with a single worker.
The lower minimum time for FastAPI indicates quicker response for some requests, but the maximum times are closer, suggesting that both frameworks experience delays under increased load.
TorchServe’s higher response times highlight the need for multiple workers to handle concurrency effectively.

Scenario 3: Multiple Workers, Increasing Concurrency

Configuration:

Workers: 5
Parallel Requests: 1, 20, 50, 100, 500
Total Requests: 500

Diagrams illustrating the performance metrics for Scenario 3:

Analysis:

TorchServe:
- - Demonstrated increased average response times with higher concurrency, as expected.
- - Maintained relatively stable maximum times compared to FastAPI, indicating better handling of high concurrency when multiple workers are used.
- - Surprisingly, at 500 parallel requests, TorchServe’s average response time decreased significantly. This could be due to efficient batching or internal optimizations kicking in under extreme load.

FastAPI:
- - Showed lower minimum times across all concurrency levels, highlighting its ability to quickly handle individual requests.
- - Experienced significant increases in maximum and average response times at higher concurrency levels, indicating potential performance bottlenecks.
- - At 500 parallel requests, FastAPI’s average response time exceeded 7 seconds, which is unacceptable for real-time applications.

Key Observations

Consistency: TorchServe provided more consistent response times, especially under high load, due to its optimized request handling and worker management.

Scalability: TorchServe scaled better with increased concurrency when configured with multiple workers. FastAPI’s performance degraded significantly at higher concurrency levels.

Resource Utilization: TorchServe’s ability to handle more requests with fewer resources can lead to cost savings in production environments.

FastAPI’s Strengths: At low concurrency, FastAPI was competitive and sometimes faster in terms of minimum response time, making it suitable for applications with low to moderate traffic.

Recommendations

High-Concurrency Applications: Use TorchServe when expecting high levels of concurrent requests. Its architecture is better suited to scale horizontally with multiple workers.

Low to Moderate Traffic: Consider FastAPI for applications with predictable, low to moderate traffic, where its simplicity and flexibility can be advantageous.

Worker Configuration: Properly configure the number of workers in TorchServe based on expected load to optimize performance.

Monitoring and Tuning: Continuously monitor performance metrics in production and adjust configurations as necessary for both TorchServe and FastAPI.

Ultimately, the choice between TorchServe and FastAPI should depend on your application’s specific requirements, such as traffic patterns, resource availability, and performance goals.

Number of Workers: The number of worker processes handling inference requests. For TorchServe, this is configurable per model.
Parallel Requests: The number of concurrent requests sent to the server, simulating different levels of user demand.
Total Requests: The total number of requests sent during each test to ensure statistical significance.

We recorded the following metrics for analysis:

Minimum Time: The shortest time taken to process a request.
Maximum Time: The longest time taken to process a request.
Average (Mean): The average time taken across all requests.
Median: The middle value in the list of recorded times, providing a measure of central tendency less affected by outliers.
Mode: The most frequently occurring response time in the dataset.

Results and Analysis

Scenario 1: Single Worker, Single Request

Configuration:

Workers: 1
Parallel Requests: 1
Total Requests: 500

Performance Metrics:

Approach	Min Time	Max Time	Average	Median	Mode
TorchServe	35.342ms	55.192ms	38.569ms	38.082ms	36.510ms
FastAPI	29.045ms	1276.5ms	34.822ms	31.326ms	33.128ms

Analysis:

FastAPI had a slightly lower minimum and average response time, indicating faster individual request handling in this minimal load scenario.
TorchServe showed more consistent performance with a smaller gap between minimum and maximum times, suggesting better reliability.
FastAPI exhibited significant variability with a high maximum time, possibly due to occasional latency spikes.

Scenario 2: Single Worker, Moderate Concurrency

Configuration:

Workers: 1
Parallel Requests: 20
Total Requests: 500

Diagrams illustrating the performance metrics for Scenario 2:

Analysis:

FastAPI outperformed TorchServe in average and median response times under moderate concurrency with a single worker.
The lower minimum time for FastAPI indicates quicker response for some requests, but the maximum times are closer, suggesting that both frameworks experience delays under increased load.
TorchServe’s higher response times highlight the need for multiple workers to handle concurrency effectively.

Scenario 3: Multiple Workers, Increasing Concurrency

Configuration:

Workers: 5
Parallel Requests: 1, 20, 50, 100, 500
Total Requests: 500

Diagrams illustrating the performance metrics for Scenario 3:

Analysis:

TorchServe:
- - Demonstrated increased average response times with higher concurrency, as expected.
- - Maintained relatively stable maximum times compared to FastAPI, indicating better handling of high concurrency when multiple workers are used.
- - Surprisingly, at 500 parallel requests, TorchServe’s average response time decreased significantly. This could be due to efficient batching or internal optimizations kicking in under extreme load.

FastAPI:
- - Showed lower minimum times across all concurrency levels, highlighting its ability to quickly handle individual requests.
- - Experienced significant increases in maximum and average response times at higher concurrency levels, indicating potential performance bottlenecks.
- - At 500 parallel requests, FastAPI’s average response time exceeded 7 seconds, which is unacceptable for real-time applications.

Key Observations

Consistency: TorchServe provided more consistent response times, especially under high load, due to its optimized request handling and worker management.

Scalability: TorchServe scaled better with increased concurrency when configured with multiple workers. FastAPI’s performance degraded significantly at higher concurrency levels.

Resource Utilization: TorchServe’s ability to handle more requests with fewer resources can lead to cost savings in production environments.

FastAPI’s Strengths: At low concurrency, FastAPI was competitive and sometimes faster in terms of minimum response time, making it suitable for applications with low to moderate traffic.

Recommendations

High-Concurrency Applications: Use TorchServe when expecting high levels of concurrent requests. Its architecture is better suited to scale horizontally with multiple workers.

Low to Moderate Traffic: Consider FastAPI for applications with predictable, low to moderate traffic, where its simplicity and flexibility can be advantageous.

Worker Configuration: Properly configure the number of workers in TorchServe based on expected load to optimize performance.

Monitoring and Tuning: Continuously monitor performance metrics in production and adjust configurations as necessary for both TorchServe and FastAPI.

Ultimately, the choice between TorchServe and FastAPI should depend on your application’s specific requirements, such as traffic patterns, resource availability, and performance goals.

Number of Workers: The number of worker processes handling inference requests. For TorchServe, this is configurable per model.
Parallel Requests: The number of concurrent requests sent to the server, simulating different levels of user demand.
Total Requests: The total number of requests sent during each test to ensure statistical significance.

We recorded the following metrics for analysis:

Minimum Time: The shortest time taken to process a request.
Maximum Time: The longest time taken to process a request.
Average (Mean): The average time taken across all requests.
Median: The middle value in the list of recorded times, providing a measure of central tendency less affected by outliers.
Mode: The most frequently occurring response time in the dataset.

Results and Analysis

Scenario 1: Single Worker, Single Request

Configuration:

Workers: 1
Parallel Requests: 1
Total Requests: 500

Performance Metrics:

Approach	Min Time	Max Time	Average	Median	Mode
TorchServe	35.342ms	55.192ms	38.569ms	38.082ms	36.510ms
FastAPI	29.045ms	1276.5ms	34.822ms	31.326ms	33.128ms

Analysis:

FastAPI had a slightly lower minimum and average response time, indicating faster individual request handling in this minimal load scenario.
TorchServe showed more consistent performance with a smaller gap between minimum and maximum times, suggesting better reliability.
FastAPI exhibited significant variability with a high maximum time, possibly due to occasional latency spikes.

Scenario 2: Single Worker, Moderate Concurrency

Configuration:

Workers: 1
Parallel Requests: 20
Total Requests: 500

Diagrams illustrating the performance metrics for Scenario 2:

Analysis:

FastAPI outperformed TorchServe in average and median response times under moderate concurrency with a single worker.
The lower minimum time for FastAPI indicates quicker response for some requests, but the maximum times are closer, suggesting that both frameworks experience delays under increased load.
TorchServe’s higher response times highlight the need for multiple workers to handle concurrency effectively.

Scenario 3: Multiple Workers, Increasing Concurrency

Configuration:

Workers: 5
Parallel Requests: 1, 20, 50, 100, 500
Total Requests: 500

Diagrams illustrating the performance metrics for Scenario 3:

Analysis:

TorchServe:
- - Demonstrated increased average response times with higher concurrency, as expected.
- - Maintained relatively stable maximum times compared to FastAPI, indicating better handling of high concurrency when multiple workers are used.
- - Surprisingly, at 500 parallel requests, TorchServe’s average response time decreased significantly. This could be due to efficient batching or internal optimizations kicking in under extreme load.

FastAPI:
- - Showed lower minimum times across all concurrency levels, highlighting its ability to quickly handle individual requests.
- - Experienced significant increases in maximum and average response times at higher concurrency levels, indicating potential performance bottlenecks.
- - At 500 parallel requests, FastAPI’s average response time exceeded 7 seconds, which is unacceptable for real-time applications.

Key Observations

Consistency: TorchServe provided more consistent response times, especially under high load, due to its optimized request handling and worker management.

Scalability: TorchServe scaled better with increased concurrency when configured with multiple workers. FastAPI’s performance degraded significantly at higher concurrency levels.

Resource Utilization: TorchServe’s ability to handle more requests with fewer resources can lead to cost savings in production environments.

FastAPI’s Strengths: At low concurrency, FastAPI was competitive and sometimes faster in terms of minimum response time, making it suitable for applications with low to moderate traffic.

Recommendations

High-Concurrency Applications: Use TorchServe when expecting high levels of concurrent requests. Its architecture is better suited to scale horizontally with multiple workers.

Low to Moderate Traffic: Consider FastAPI for applications with predictable, low to moderate traffic, where its simplicity and flexibility can be advantageous.

Worker Configuration: Properly configure the number of workers in TorchServe based on expected load to optimize performance.

Monitoring and Tuning: Continuously monitor performance metrics in production and adjust configurations as necessary for both TorchServe and FastAPI.

Ultimately, the choice between TorchServe and FastAPI should depend on your application’s specific requirements, such as traffic patterns, resource availability, and performance goals.

Model Training: Develop and train your models in PyTorch for tasks such as image classification, object detection, or natural language processing.
Model Packaging: Use the Torch Model Archiver to package the trained model, its associated weights, and optional custom handlers into a single .mar file.
Model Serving: Deploy the packaged model using TorchServe, which efficiently handles inference requests with built-in handlers or custom logic for specialized use cases.
Model Versioning and A/B Testing: TorchServe supports simultaneously serving multiple model versions, facilitating A/B testing and smooth transitions between model updates in production.
Client Interaction: Client applications interact with TorchServe by sending REST API requests and receiving real-time predictions, seamlessly integrating model inference into their workflows.

For detailed information, go to the TorchServe documentation.

TorchServe Architecture

The architecture of TorchServe is designed to manage model serving at scale efficiently:

Frontend: Handles incoming client requests and responses, and manages the lifecycle of models, including loading and unloading.
Model Workers: Dedicated processes responsible for running model inference. Each worker handles requests for a specific model and can utilize CPU or GPU resources as configured.
Model Store: A repository where all loadable models (.mar files) are stored. TorchServe loads models from this location upon request.
Plugins: Extend TorchServe’s functionality with custom components such as authentication mechanisms, custom endpoints, or specialized batching algorithms.

By abstracting these components, TorchServe provides a robust and flexible platform tailored to various deployment needs while maintaining high performance and scalability.

Setting Up Your Environment for TorchServe and Detectron2

To set up the environment for serving models with TorchServe and working with the custom model handler for Detectron2, follow these detailed steps:

Step-by-Step Guide

Install Python and Required Tools:
- - Ensure Python 3.8 or higher is installed.

Clone the TorchServe Repository:

git clone https://github.com/pytorch/serve.git

Navigate to the Example Directory:

cd serve/examples/object_detector/detectron2

Install Dependencies:

- Install all required Python packages by running:

pip install -r requirements.txt

- Install Detectron2 and a compatible version of NumPy:

pip install git+https://github.com/facebookresearch/detectron2.git && pip install numpy==1.21.6

Verify Installation:
- - Run the following command to check if TorchServe is installed and accessible:
```
torchserve --help
```
- - Verify the installation of Detectron2 by importing it in a Python shell:
```
import detectron2
print("Detectron2 is installed correctly!")
```

Refer to the complete README file.

Deep Dive into the Model Handler Code

Here is a brief overview of each step:

Model Initialization: The ModelHandler loads the Detectron2 model once during startup, checking for GPU availability to set the device (CPU or GPU) accordingly. It initializes the model with the specified weights and configuration, preparing the DefaultPredictor for efficient inference without reloading for each request.
Preprocessing Input Data: For each incoming request, the handler validates the image data, reads it into a BytesIO stream, and uses PIL to convert it to RGB format. It then transforms the image into a NumPy array and converts it to BGR format, ensuring it’s correctly formatted for the model.
Performing Inference: The handler uses the pre-initialized predictor to perform inference on the preprocessed images. It processes each image to obtain predictions like classes, bounding boxes, and scores, utilizing the model efficiently without additional overhead.
Postprocessing Output Data: It extracts relevant prediction data from the inference outputs, such as classes and bounding boxes, and formats them into a dictionary. The handler then serializes this dictionary into a JSON string, providing a client-friendly response.
Handling Requests: The handle method orchestrates the entire process by ensuring the model is initialized and then sequentially executing the preprocessing, inference, and postprocessing steps. It handles errors gracefully and returns the final predictions to the client.

You can find the code for the ModelHandler here: Feature/Add TorchServe Detectron2 on GitHub

Managing & Scaling TorchServe Models

It is best to use Conda or other virtual environments tailored to your project’s requirements to effectively manage dependencies. Then, install TorchServe following the instructions provided on GitHub. If you’ve already completed the installation as outlined in the README file, you can skip this step.

Steps to Deploy Your Model with TorchServe

Download the TorchServe Repository:
- - Access the TorchServe examples by running the following commands:
```
mkdir torchserve-examples
cd torchserve-examples
git clone https://github.com/pytorch/serve.git
```
Download a Pre-Trained Model:
- - Download a Fast R-CNN object detection model by running:
```
wget https://download.pytorch.org/models/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth
```

Convert the Model to TorchServe Format:

- Use the torch-model-archiver tool to package the model into a .mar file by running the following command:

torch-model-archiver --model-name model --version 1.0 --serialized-file model.pth --extra-files config.yaml --handler serve/examples/object_detector/detectron2/detectron2-handler.py -f
ls *.mar

Host the Model:

- Move the .mar file to a model store directory and start TorchServe:

mkdir model_store
mv model.mar model_store/
torchserve --start --model-store model_store --models model=model.mar --disable-token-auth

Test the Model with TorchServe:

- Test the hosted model by opening another terminal on the same host and using the following commands (you can use tmux to manage multiple sessions):

curl -O https://s3.amazonaws.com/model-server/inputs/kitten.jpg
curl -X POST http://127.0.0.1:8080/predictions/model -T kitten.jpg

- Expected output:

[
    {
        "class": "tiger_cat",
        "box": [34.5, 23.1, 100.4, 200.2],
        "score": 0.4693356156349182
    },
    {
        "class": "tabby",
        "box": [12.0, 50.1, 90.2, 180.3],
        "score": 0.46338796615600586
    }
]

List Registered Models:

- Query the list of models hosted by TorchServe:

curl "http://localhost:8081/models"

- Expected output:

{
    "models": [
        {
            "modelName": "model",
            "modelUrl": "model.mar"
        }
    ]
}

Scale Model Workers:
- - A new model has no workers assigned to it, so set a minimum number of workers with the following code:
```
curl -v -X PUT "http://localhost:8081/models/model?min_worker=2"
curl "http://localhost:8081/models/model"
```
Unregister a Model:
- - Remove the model from TorchServe:
```
curl -X DELETE http://localhost:8081/models/model/
```
Version a Model:
- - To version a model, when calling torch-model-archiver, pass a version number to the --version:
```
torch-model-archiver --model-name model --version 1.0 ...
```

Comparing TorchServe vs FastAPI Model Serving Options

When deploying PyTorch models to production, choosing the right serving framework plays an important role in optimal performance, scalability, and efficient resource management. At UpStart Commerce, we compared two common approaches: TorchServe and FastAPI. This evaluation provides insights based on data collected under various configurations and workloads.

Methodology

We conducted a series of performance tests to measure and compare the latency and throughput of TorchServe and FastAPI when serving the same PyTorch model. The tests varied in terms of:

Number of Workers: The number of worker processes handling inference requests. For TorchServe, this is configurable per model.
Parallel Requests: The number of concurrent requests sent to the server, simulating different levels of user demand.
Total Requests: The total number of requests sent during each test to ensure statistical significance.

We recorded the following metrics for analysis:

Minimum Time: The shortest time taken to process a request.
Maximum Time: The longest time taken to process a request.
Average (Mean): The average time taken across all requests.
Median: The middle value in the list of recorded times, providing a measure of central tendency less affected by outliers.
Mode: The most frequently occurring response time in the dataset.

Results and Analysis

Scenario 1: Single Worker, Single Request

Configuration:

Workers: 1
Parallel Requests: 1
Total Requests: 500

Performance Metrics:

Approach	Min Time	Max Time	Average	Median	Mode
TorchServe	35.342ms	55.192ms	38.569ms	38.082ms	36.510ms
FastAPI	29.045ms	1276.5ms	34.822ms	31.326ms	33.128ms

Analysis:

FastAPI had a slightly lower minimum and average response time, indicating faster individual request handling in this minimal load scenario.
TorchServe showed more consistent performance with a smaller gap between minimum and maximum times, suggesting better reliability.
FastAPI exhibited significant variability with a high maximum time, possibly due to occasional latency spikes.

Scenario 2: Single Worker, Moderate Concurrency

Configuration:

Workers: 1
Parallel Requests: 20
Total Requests: 500

Diagrams illustrating the performance metrics for Scenario 2:

Analysis:

FastAPI outperformed TorchServe in average and median response times under moderate concurrency with a single worker.
The lower minimum time for FastAPI indicates quicker response for some requests, but the maximum times are closer, suggesting that both frameworks experience delays under increased load.
TorchServe’s higher response times highlight the need for multiple workers to handle concurrency effectively.

Scenario 3: Multiple Workers, Increasing Concurrency

Configuration:

Workers: 5
Parallel Requests: 1, 20, 50, 100, 500
Total Requests: 500

Diagrams illustrating the performance metrics for Scenario 3:

Analysis:

TorchServe:
- - Demonstrated increased average response times with higher concurrency, as expected.
- - Maintained relatively stable maximum times compared to FastAPI, indicating better handling of high concurrency when multiple workers are used.
- - Surprisingly, at 500 parallel requests, TorchServe’s average response time decreased significantly. This could be due to efficient batching or internal optimizations kicking in under extreme load.

FastAPI:
- - Showed lower minimum times across all concurrency levels, highlighting its ability to quickly handle individual requests.
- - Experienced significant increases in maximum and average response times at higher concurrency levels, indicating potential performance bottlenecks.
- - At 500 parallel requests, FastAPI’s average response time exceeded 7 seconds, which is unacceptable for real-time applications.

Key Observations

Consistency: TorchServe provided more consistent response times, especially under high load, due to its optimized request handling and worker management.

Scalability: TorchServe scaled better with increased concurrency when configured with multiple workers. FastAPI’s performance degraded significantly at higher concurrency levels.

Resource Utilization: TorchServe’s ability to handle more requests with fewer resources can lead to cost savings in production environments.

FastAPI’s Strengths: At low concurrency, FastAPI was competitive and sometimes faster in terms of minimum response time, making it suitable for applications with low to moderate traffic.

Recommendations

High-Concurrency Applications: Use TorchServe when expecting high levels of concurrent requests. Its architecture is better suited to scale horizontally with multiple workers.

Low to Moderate Traffic: Consider FastAPI for applications with predictable, low to moderate traffic, where its simplicity and flexibility can be advantageous.

Worker Configuration: Properly configure the number of workers in TorchServe based on expected load to optimize performance.

Monitoring and Tuning: Continuously monitor performance metrics in production and adjust configurations as necessary for both TorchServe and FastAPI.

Ultimately, the choice between TorchServe and FastAPI should depend on your application’s specific requirements, such as traffic patterns, resource availability, and performance goals.

Model Training: Develop and train your models in PyTorch for tasks such as image classification, object detection, or natural language processing.
Model Packaging: Use the Torch Model Archiver to package the trained model, its associated weights, and optional custom handlers into a single .mar file.
Model Serving: Deploy the packaged model using TorchServe, which efficiently handles inference requests with built-in handlers or custom logic for specialized use cases.
Model Versioning and A/B Testing: TorchServe supports simultaneously serving multiple model versions, facilitating A/B testing and smooth transitions between model updates in production.
Client Interaction: Client applications interact with TorchServe by sending REST API requests and receiving real-time predictions, seamlessly integrating model inference into their workflows.

For detailed information, go to the TorchServe documentation.

TorchServe Architecture

The architecture of TorchServe is designed to manage model serving at scale efficiently:

Frontend: Handles incoming client requests and responses, and manages the lifecycle of models, including loading and unloading.
Model Workers: Dedicated processes responsible for running model inference. Each worker handles requests for a specific model and can utilize CPU or GPU resources as configured.
Model Store: A repository where all loadable models (.mar files) are stored. TorchServe loads models from this location upon request.
Plugins: Extend TorchServe’s functionality with custom components such as authentication mechanisms, custom endpoints, or specialized batching algorithms.

By abstracting these components, TorchServe provides a robust and flexible platform tailored to various deployment needs while maintaining high performance and scalability.

Setting Up Your Environment for TorchServe and Detectron2

To set up the environment for serving models with TorchServe and working with the custom model handler for Detectron2, follow these detailed steps:

Step-by-Step Guide

Install Python and Required Tools:
- - Ensure Python 3.8 or higher is installed.

Clone the TorchServe Repository:

git clone https://github.com/pytorch/serve.git

Navigate to the Example Directory:

cd serve/examples/object_detector/detectron2

Install Dependencies:

- Install all required Python packages by running:

pip install -r requirements.txt

- Install Detectron2 and a compatible version of NumPy:

pip install git+https://github.com/facebookresearch/detectron2.git && pip install numpy==1.21.6

Verify Installation:
- - Run the following command to check if TorchServe is installed and accessible:
```
torchserve --help
```
- - Verify the installation of Detectron2 by importing it in a Python shell:
```
import detectron2
print("Detectron2 is installed correctly!")
```

Refer to the complete README file.

Deep Dive into the Model Handler Code

Here is a brief overview of each step:

Model Initialization: The ModelHandler loads the Detectron2 model once during startup, checking for GPU availability to set the device (CPU or GPU) accordingly. It initializes the model with the specified weights and configuration, preparing the DefaultPredictor for efficient inference without reloading for each request.
Preprocessing Input Data: For each incoming request, the handler validates the image data, reads it into a BytesIO stream, and uses PIL to convert it to RGB format. It then transforms the image into a NumPy array and converts it to BGR format, ensuring it’s correctly formatted for the model.
Performing Inference: The handler uses the pre-initialized predictor to perform inference on the preprocessed images. It processes each image to obtain predictions like classes, bounding boxes, and scores, utilizing the model efficiently without additional overhead.
Postprocessing Output Data: It extracts relevant prediction data from the inference outputs, such as classes and bounding boxes, and formats them into a dictionary. The handler then serializes this dictionary into a JSON string, providing a client-friendly response.
Handling Requests: The handle method orchestrates the entire process by ensuring the model is initialized and then sequentially executing the preprocessing, inference, and postprocessing steps. It handles errors gracefully and returns the final predictions to the client.

You can find the code for the ModelHandler here: Feature/Add TorchServe Detectron2 on GitHub

Managing & Scaling TorchServe Models

Steps to Deploy Your Model with TorchServe

Download the TorchServe Repository:
- - Access the TorchServe examples by running the following commands:
```
mkdir torchserve-examples
cd torchserve-examples
git clone https://github.com/pytorch/serve.git
```
Download a Pre-Trained Model:
- - Download a Fast R-CNN object detection model by running:
```
wget https://download.pytorch.org/models/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth
```

Convert the Model to TorchServe Format:

- Use the torch-model-archiver tool to package the model into a .mar file by running the following command:

torch-model-archiver --model-name model --version 1.0 --serialized-file model.pth --extra-files config.yaml --handler serve/examples/object_detector/detectron2/detectron2-handler.py -f
ls *.mar

Host the Model:

- Move the .mar file to a model store directory and start TorchServe:

mkdir model_store
mv model.mar model_store/
torchserve --start --model-store model_store --models model=model.mar --disable-token-auth

Test the Model with TorchServe:

- Test the hosted model by opening another terminal on the same host and using the following commands (you can use tmux to manage multiple sessions):

curl -O https://s3.amazonaws.com/model-server/inputs/kitten.jpg
curl -X POST http://127.0.0.1:8080/predictions/model -T kitten.jpg

- Expected output:

[
    {
        "class": "tiger_cat",
        "box": [34.5, 23.1, 100.4, 200.2],
        "score": 0.4693356156349182
    },
    {
        "class": "tabby",
        "box": [12.0, 50.1, 90.2, 180.3],
        "score": 0.46338796615600586
    }
]

List Registered Models:

- Query the list of models hosted by TorchServe:

curl "http://localhost:8081/models"

- Expected output:

{
    "models": [
        {
            "modelName": "model",
            "modelUrl": "model.mar"
        }
    ]
}

Scale Model Workers:
- - A new model has no workers assigned to it, so set a minimum number of workers with the following code:
```
curl -v -X PUT "http://localhost:8081/models/model?min_worker=2"
curl "http://localhost:8081/models/model"
```
Unregister a Model:
- - Remove the model from TorchServe:
```
curl -X DELETE http://localhost:8081/models/model/
```
Version a Model:
- - To version a model, when calling torch-model-archiver, pass a version number to the --version:
```
torch-model-archiver --model-name model --version 1.0 ...
```

Comparing TorchServe vs FastAPI Model Serving Options

Methodology

We conducted a series of performance tests to measure and compare the latency and throughput of TorchServe and FastAPI when serving the same PyTorch model. The tests varied in terms of:

Number of Workers: The number of worker processes handling inference requests. For TorchServe, this is configurable per model.
Parallel Requests: The number of concurrent requests sent to the server, simulating different levels of user demand.
Total Requests: The total number of requests sent during each test to ensure statistical significance.

We recorded the following metrics for analysis:

Minimum Time: The shortest time taken to process a request.
Maximum Time: The longest time taken to process a request.
Average (Mean): The average time taken across all requests.
Median: The middle value in the list of recorded times, providing a measure of central tendency less affected by outliers.
Mode: The most frequently occurring response time in the dataset.

Results and Analysis

Scenario 1: Single Worker, Single Request

Configuration:

Workers: 1
Parallel Requests: 1
Total Requests: 500

Performance Metrics:

Approach	Min Time	Max Time	Average	Median	Mode
TorchServe	35.342ms	55.192ms	38.569ms	38.082ms	36.510ms
FastAPI	29.045ms	1276.5ms	34.822ms	31.326ms	33.128ms

Analysis:

FastAPI had a slightly lower minimum and average response time, indicating faster individual request handling in this minimal load scenario.
TorchServe showed more consistent performance with a smaller gap between minimum and maximum times, suggesting better reliability.
FastAPI exhibited significant variability with a high maximum time, possibly due to occasional latency spikes.

Scenario 2: Single Worker, Moderate Concurrency

Configuration:

Workers: 1
Parallel Requests: 20
Total Requests: 500

Diagrams illustrating the performance metrics for Scenario 2:

Analysis:

FastAPI outperformed TorchServe in average and median response times under moderate concurrency with a single worker.
The lower minimum time for FastAPI indicates quicker response for some requests, but the maximum times are closer, suggesting that both frameworks experience delays under increased load.
TorchServe’s higher response times highlight the need for multiple workers to handle concurrency effectively.

Scenario 3: Multiple Workers, Increasing Concurrency

Configuration:

Workers: 5
Parallel Requests: 1, 20, 50, 100, 500
Total Requests: 500

Diagrams illustrating the performance metrics for Scenario 3:

Analysis:

TorchServe:
- - Demonstrated increased average response times with higher concurrency, as expected.
- - Maintained relatively stable maximum times compared to FastAPI, indicating better handling of high concurrency when multiple workers are used.
- - Surprisingly, at 500 parallel requests, TorchServe’s average response time decreased significantly. This could be due to efficient batching or internal optimizations kicking in under extreme load.

FastAPI:
- - Showed lower minimum times across all concurrency levels, highlighting its ability to quickly handle individual requests.
- - Experienced significant increases in maximum and average response times at higher concurrency levels, indicating potential performance bottlenecks.
- - At 500 parallel requests, FastAPI’s average response time exceeded 7 seconds, which is unacceptable for real-time applications.

Key Observations

Consistency: TorchServe provided more consistent response times, especially under high load, due to its optimized request handling and worker management.

Scalability: TorchServe scaled better with increased concurrency when configured with multiple workers. FastAPI’s performance degraded significantly at higher concurrency levels.

Resource Utilization: TorchServe’s ability to handle more requests with fewer resources can lead to cost savings in production environments.

FastAPI’s Strengths: At low concurrency, FastAPI was competitive and sometimes faster in terms of minimum response time, making it suitable for applications with low to moderate traffic.

Recommendations

High-Concurrency Applications: Use TorchServe when expecting high levels of concurrent requests. Its architecture is better suited to scale horizontally with multiple workers.

Low to Moderate Traffic: Consider FastAPI for applications with predictable, low to moderate traffic, where its simplicity and flexibility can be advantageous.

Worker Configuration: Properly configure the number of workers in TorchServe based on expected load to optimize performance.

Monitoring and Tuning: Continuously monitor performance metrics in production and adjust configurations as necessary for both TorchServe and FastAPI.

Ultimately, the choice between TorchServe and FastAPI should depend on your application’s specific requirements, such as traffic patterns, resource availability, and performance goals.

What is TorchServe?

TorchServe Workflow

Scenario 3: Multiple Workers, Increasing Concurrency

Key Observations

Recommendations

Scenario 2: Single Worker, Moderate Concurrency

Scenario 3: Multiple Workers, Increasing Concurrency

Key Observations

Recommendations

Products

Resources

Contact