# Practical Performance Guarantees for Pipelined DNN Inference

2311.03703

2

0

🚀

## Abstract

We optimize pipeline parallelism for deep neural network (DNN) inference by partitioning model graphs into $k$ stages and minimizing the running time of the bottleneck stage, including communication. We give practical and effective algorithms for this NP-hard problem, but our emphasis is on tackling the practitioner's dilemma of deciding when a solution is good enough. To this end, we design novel mixed-integer programming (MIP) relaxations for proving lower bounds. Applying these methods to a diverse testbed of 369 production models, for $k in {2, 4, 8, 16, 32, 64}$, we empirically show that these lower bounds are strong enough to be useful in practice. Our lower bounds are substantially stronger than standard combinatorial bounds. For example, evaluated via geometric means across our production testbed with $k = 16$ pipeline stages, our MIP formulations raised the lower bound from 0.4598 to 0.9452, expressed as a fraction of the best partition found. In other words, our improved lower bounds closed the optimality gap by a factor of 9.855x.

Get summaries of the top AI research delivered straight to your inbox:

## Overview

- The researchers present practical and effective algorithms to optimize pipeline parallelism for deep neural network (DNN) inference.
- They partition DNN models into k stages and minimize the running time of the bottleneck stage, including communication.
- The researchers focus on helping practitioners determine when a solution is good enough, designing novel mixed-integer programming (MIP) relaxations to prove lower bounds.
- They evaluate their methods on a diverse testbed of 369 production models with k = 2, 4, 8, 16, 32, and 64 pipeline stages.

## Plain English Explanation

The researchers have developed a way to make deep learning models run more efficiently on multiple processors. When running a deep learning model, the model is often split into different stages that can be processed in parallel. This is called pipeline parallelism. The researchers' goal is to partition the model into the optimal number of stages (k) to minimize the total processing time, including the time needed to communicate between stages.

This is a challenging optimization problem, so the researchers have designed new algorithms to find good solutions. Importantly, they focus on helping practitioners - the people actually using these models in real-world applications - decide when a solution is "good enough" and doesn't need further optimization.

To do this, the researchers develop novel mathematical programming techniques to calculate lower bounds on the optimal solution. These lower bounds help practitioners understand how close their current solution is to the best possible outcome. The researchers show that their new lower bound calculations are much tighter (i.e. closer to the optimal solution) than standard approaches, reducing the "optimality gap" by almost 10 times on average.

## Technical Explanation

The core of the researchers' approach is to partition the DNN model graph into k stages and minimize the running time of the bottleneck stage, including communication overhead. This is a challenging NP-hard problem that the researchers tackle with practical and effective algorithms.

A key innovation is the design of novel mixed-integer programming (MIP) relaxations to prove tight lower bounds on the optimal solution. These bounds help practitioners understand how close their current solution is to the best possible partition.

The researchers evaluate their methods on a diverse testbed of 369 production DNN models, experimenting with k = 2, 4, 8, 16, 32, and 64 pipeline stages. They find that their MIP formulations substantially improve upon standard combinatorial lower bounds. For example, with k = 16 stages, the lower bound is raised from 0.4598 to 0.9452 as a fraction of the best partition found - a 9.855x improvement in the optimality gap.

This work builds on prior research in performance modeling for machine learning training and resource-aware DNN deployment, showing how careful algorithmic techniques can make DNN inference more efficient in practical settings.

## Critical Analysis

The researchers acknowledge several limitations and areas for future work. First, their algorithms assume a static DNN model graph, whereas in practice models may be dynamically changing. Extensions to handle dynamic models would be valuable.

Additionally, the researchers focus only on minimizing the running time of the pipeline bottleneck. Other objectives, such as fairness across pipeline stages or energy efficiency, could also be important in real-world deployments. Exploring multi-objective optimization would be an interesting direction.

The testbed used in the experiments, while diverse, is limited to 369 production models. Evaluating the techniques on a broader range of models, including different DNN architectures and application domains, could provide further insights.

Finally, the researchers do not discuss the computational overhead of their MIP-based lower bound calculations. In practice, the time required to compute these bounds may be a limiting factor, especially for larger models or more pipeline stages. Developing more efficient bounding techniques would enhance the practicality of this approach.

Overall, this work makes valuable contributions to the challenge of efficient DNN inference, but there remain opportunities to extend the techniques to handle additional real-world complexities and constraints.

## Conclusion

The researchers have developed practical algorithms to optimize pipeline parallelism for DNN inference, with a focus on helping practitioners determine when a solution is good enough. By designing novel MIP relaxations to prove tight lower bounds, they are able to substantially reduce the optimality gap compared to standard approaches.

This work advances the state of the art in DNN performance optimization, providing tools and insights that can benefit a wide range of practitioners deploying deep learning models in the real world. The researchers' emphasis on bridging the gap between theory and practice is particularly commendable and should serve as a model for future research in this domain.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

## Related Papers

🤿

### Deep learning enhanced mixed integer optimization: Learning to reduce model dimensionality

Niki Triantafyllou, Maria M. Papathanasiou

0

0

This work introduces a framework to address the computational complexity inherent in Mixed-Integer Programming (MIP) models by harnessing the potential of deep learning. By employing deep learning, we construct problem-specific heuristics that identify and exploit common structures across MIP instances. We train deep learning models to estimate complicating binary variables for target MIP problem instances. The resulting reduced MIP models are solved using standard off-the-shelf solvers. We present an algorithm for generating synthetic data enhancing the robustness and generalizability of our models across diverse MIP instances. We compare the effectiveness of (a) feed-forward neural networks (ANN) and (b) convolutional neural networks (CNN). To enhance the framework's performance, we employ Bayesian optimization for hyperparameter tuning, aiming to maximize the occurrence of global optimum solutions. We apply this framework to a flow-based facility location allocation MIP formulation that describes long-term investment planning and medium-term tactical scheduling in a personalized medicine supply chain.

5/13/2024

🤯

### New!IPA: Inference Pipeline Adaptation to Achieve High Accuracy and Cost-Efficiency

Saeid Ghafouri, Kamran Razavi, Mehran Salmani, Alireza Sanaee, Tania Lorido-Botran, Lin Wang, Joseph Doyle, Pooyan Jamshidi

0

0

Efficiently optimizing multi-model inference pipelines for fast, accurate, and cost-effective inference is a crucial challenge in machine learning production systems, given their tight end-to-end latency requirements. To simplify the exploration of the vast and intricate trade-off space of latency, accuracy, and cost in inference pipelines, providers frequently opt to consider one of them. However, the challenge lies in reconciling latency, accuracy, and cost trade-offs. To address this challenge and propose a solution to efficiently manage model variants in inference pipelines, we present IPA, an online deep learning Inference Pipeline Adaptation system that efficiently leverages model variants for each deep learning task. Model variants are different versions of pre-trained models for the same deep learning task with variations in resource requirements, latency, and accuracy. IPA dynamically configures batch size, replication, and model variants to optimize accuracy, minimize costs, and meet user-defined latency Service Level Agreements (SLAs) using Integer Programming. It supports multi-objective settings for achieving different trade-offs between accuracy and cost objectives while remaining adaptable to varying workloads and dynamic traffic patterns. Navigating a wider variety of configurations allows namex{} to achieve better trade-offs between cost and accuracy objectives compared to existing methods. Extensive experiments in a Kubernetes implementation with five real-world inference pipelines demonstrate that IPA improves end-to-end accuracy by up to 21% with a minimal cost increase. The code and data for replications are available at https://github.com/reconfigurable-ml-pipeline/ipa.

5/28/2024

🔄

### Efficient Multi-Processor Scheduling in Increasingly Realistic Models

P'al Andr'as Papp, Georg Anegg, Aikaterini Karanasiou, A. N. Yzelman

0

0

We study the problem of efficiently scheduling a computational DAG on multiple processors. The majority of previous works have developed and compared algorithms for this problem in relatively simple models; in contrast to this, we analyze this problem in a more realistic model that captures many real-world aspects, such as communication costs, synchronization costs, and the hierarchical structure of modern processing architectures. For this we extend the well-established BSP model of parallel computing with non-uniform memory access (NUMA) effects. We then develop a range of new scheduling algorithms to minimize the scheduling cost in this more complex setting: several initialization heuristics, a hill-climbing local search method, and several approaches that formulate (and solve) the scheduling problem as an Integer Linear Program (ILP). We combine these algorithms into a single framework, and conduct experiments on a diverse set of real-world computational DAGs to show that the resulting scheduler significantly outperforms both academic and practical baselines. In particular, even without NUMA effects, our scheduler finds solutions of 24%-44% smaller cost on average than the baselines, and in case of NUMA effects, it achieves up to a factor $2.5times$ improvement compared to the baselines. Finally, we also develop a multilevel scheduling algorithm, which provides up to almost a factor $5times$ improvement in the special case when the problem is dominated by very high communication costs.

4/24/2024

### New!Pipeline Parallelism with Controllable Memory

Penghui Qi, Xinyi Wan, Nyamdavaa Amar, Min Lin

0

0

Pipeline parallelism has been widely explored, but most existing schedules lack a systematic methodology. In this paper, we propose a framework to decompose pipeline schedules as repeating a building block and we show that the lifespan of the building block decides the peak activation memory of the pipeline schedule. Guided by the observations, we find that almost all existing pipeline schedules, to the best of our knowledge, are memory inefficient. To address this, we introduce a family of memory efficient building blocks with controllable activation memory, which can reduce the peak activation memory to 1/2 of 1F1B without sacrificing efficiency, and even to 1/3 with comparable throughput. We can also achieve almost zero pipeline bubbles while maintaining the same activation memory as 1F1B. Our evaluations demonstrate that in pure pipeline parallelism settings, our methods outperform 1F1B by from 7% to 55% in terms of throughput. When employing a grid search over hybrid parallelism hyperparameters in practical scenarios, our proposed methods demonstrate a 16% throughput improvement over the 1F1B baseline for large language models.

5/27/2024