Edge Intelligence Optimization for Large Language Model Inference with Batching and Quantization

2405.07140

YC

0

Reddit

0

Published 5/14/2024 by Xinyuan Zhang, Jiang Liu, Zehui Xiong, Yudong Huang, Gaochang Xie, Ran Zhang

🛠️

Abstract

Generative Artificial Intelligence (GAI) is taking the world by storm with its unparalleled content creation ability. Large Language Models (LLMs) are at the forefront of this movement. However, the significant resource demands of LLMs often require cloud hosting, which raises issues regarding privacy, latency, and usage limitations. Although edge intelligence has long been utilized to solve these challenges by enabling real-time AI computation on ubiquitous edge resources close to data sources, most research has focused on traditional AI models and has left a gap in addressing the unique characteristics of LLM inference, such as considerable model size, auto-regressive processes, and self-attention mechanisms. In this paper, we present an edge intelligence optimization problem tailored for LLM inference. Specifically, with the deployment of the batching technique and model quantization on resource-limited edge devices, we formulate an inference model for transformer decoder-based LLMs. Furthermore, our approach aims to maximize the inference throughput via batch scheduling and joint allocation of communication and computation resources, while also considering edge resource constraints and varying user requirements of latency and accuracy. To address this NP-hard problem, we develop an optimal Depth-First Tree-Searching algorithm with online tree-Pruning (DFTSP) that operates within a feasible time complexity. Simulation results indicate that DFTSP surpasses other batching benchmarks in throughput across diverse user settings and quantization techniques, and it reduces time complexity by over 45% compared to the brute-force searching method.

Create account to get full access

or

If you already have an account, we'll log you in

Overview

  • Large Language Models (LLMs) are at the forefront of Generative Artificial Intelligence (GAI), enabling unprecedented content creation abilities.
  • However, the significant resource demands of LLMs often require cloud hosting, which raises issues regarding privacy, latency, and usage limitations.
  • Edge intelligence has been used to address these challenges by enabling real-time AI computation on ubiquitous edge resources close to data sources.
  • Most research has focused on traditional AI models, leaving a gap in addressing the unique characteristics of LLM inference, such as considerable model size, auto-regressive processes, and self-attention mechanisms.

Plain English Explanation

The paper discusses a new approach to using edge intelligence to optimize the performance of Large Language Models (LLMs) on resource-constrained edge devices. LLMs are a type of Generative Artificial Intelligence (GAI) that can create highly impressive content, but they require a lot of computing power, often relying on cloud-based servers. This can be problematic, as it can lead to privacy concerns, high latency, and limitations on usage.

The researchers in this paper have developed a way to run LLM inference (the process of using the model to generate content) on edge devices, such as smartphones or IoT sensors, which are located close to the data sources. This helps to address the issues with cloud-based hosting by enabling real-time AI computation and keeping the data local.

However, the unique characteristics of LLMs, such as their large model size and complex mechanisms like auto-regression and self-attention, have made it challenging to adapt existing edge intelligence solutions. The paper presents a new optimization problem and algorithm specifically tailored for LLM inference on edge devices.

Technical Explanation

The key elements of the paper are:

  1. Inference Model for Transformer Decoder-based LLMs: The researchers have formulated an inference model for transformer decoder-based LLMs, which are a common type of LLM architecture. This model takes into account the unique characteristics of LLMs, such as their considerable model size and auto-regressive processes.

  2. Optimization Problem: The researchers have defined an optimization problem that aims to maximize the inference throughput of LLMs on edge devices. This involves batch scheduling and the joint allocation of communication and computation resources, while also considering edge resource constraints and varying user requirements for latency and accuracy.

  3. Optimal Depth-First Tree-Searching Algorithm with Online Tree-Pruning (DFTSP): To address the NP-hard optimization problem, the researchers have developed an algorithm called DFTSP. This algorithm uses a depth-first tree search with online pruning to find the optimal solution within a feasible time complexity.

  4. Evaluation: The researchers have simulated the performance of their DFTSP algorithm and compared it to other batching benchmarks. The results show that DFTSP outperforms the benchmarks in terms of throughput across diverse user settings and quantization techniques, and it reduces time complexity by over 45% compared to the brute-force searching method.

Critical Analysis

The paper presents a novel and promising approach to optimizing LLM inference on edge devices, which could help address some of the key challenges associated with cloud-based hosting of these models. However, there are a few potential limitations and areas for further research:

  1. Model Diversity: The paper focuses on transformer decoder-based LLMs, but there are other types of LLM architectures (e.g., GPT-3, BERT) that may have different characteristics and requirements. It would be valuable to investigate how the proposed approach could be adapted to handle a broader range of LLM models.

  2. Real-World Deployment: The evaluation in the paper is based on simulations, and it would be important to validate the performance of the DFTSP algorithm in real-world edge deployments, where factors like network latency, device heterogeneity, and dynamic workloads may introduce additional challenges.

  3. Scalability and Generalizability: As the size and complexity of LLMs continue to grow, it will be crucial to ensure that the optimization approach can scale effectively and remain generalizable to future advancements in the field.

  4. Benchmarking and Standardization: The paper mentions the use of a custom benchmarking suite, but it would be valuable to see the proposed approach evaluated against a more widely-accepted LLM benchmarking framework to facilitate broader comparisons and adoption.

Conclusion

This paper presents an innovative approach to optimizing LLM inference on resource-constrained edge devices, which could help address the challenges of cloud-based hosting and unlock the potential of FPGA-based spatial acceleration for LLMs. The proposed optimization problem and DFTSP algorithm show promising results in terms of throughput and time complexity, but further research is needed to validate the approach in real-world scenarios and ensure its scalability and generalizability as LLM technology continues to evolve.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

SqueezeLLM: Dense-and-Sparse Quantization

SqueezeLLM: Dense-and-Sparse Quantization

Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, Kurt Keutzer

YC

0

Reddit

0

Generative Large Language Models (LLMs) have demonstrated remarkable results for a wide range of tasks. However, deploying these models for inference has been a significant challenge due to their unprecedented resource requirements. This has forced existing deployment frameworks to use multi-GPU inference pipelines, which are often complex and costly, or to use smaller and less performant models. In this work, we demonstrate that the main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, specifically for single batch inference. While quantization has emerged as a promising solution by representing weights with reduced precision, previous efforts have often resulted in notable performance degradation. To address this, we introduce SqueezeLLM, a post-training quantization framework that not only enables lossless compression to ultra-low precisions of up to 3-bit, but also achieves higher quantization performance under the same memory constraint. Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format. When applied to the LLaMA models, our 3-bit quantization significantly reduces the perplexity gap from the FP16 baseline by up to 2.1x as compared to the state-of-the-art methods with the same memory requirement. Furthermore, when deployed on an A6000 GPU, our quantized models achieve up to 2.3x speedup compared to the baseline. Our code is available at https://github.com/SqueezeAILab/SqueezeLLM.

Read more

6/6/2024

💬

On the Compressibility of Quantized Large Language Models

Yu Mao, Weilan Wang, Hongchao Du, Nan Guan, Chun Jason Xue

YC

0

Reddit

0

Deploying Large Language Models (LLMs) on edge or mobile devices offers significant benefits, such as enhanced data privacy and real-time processing capabilities. However, it also faces critical challenges due to the substantial memory requirement of LLMs. Quantization is an effective way of reducing the model size while maintaining good performance. However, even after quantization, LLMs may still be too big to fit entirely into the limited memory of edge or mobile devices and have to be partially loaded from the storage to complete the inference. In this case, the I/O latency of model loading becomes the bottleneck of the LLM inference latency. In this work, we take a preliminary step of studying applying data compression techniques to reduce data movement and thus speed up the inference of quantized LLM on memory-constrained devices. In particular, we discussed the compressibility of quantized LLMs, the trade-off between the compressibility and performance of quantized LLMs, and opportunities to optimize both of them jointly.

Read more

5/7/2024

A Survey on Efficient Inference for Large Language Models

A Survey on Efficient Inference for Large Language Models

Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang

YC

0

Reddit

0

Large Language Models (LLMs) have attracted extensive attention due to their remarkable performance across various tasks. However, the substantial computational and memory requirements of LLM inference pose challenges for deployment in resource-constrained scenarios. Efforts within the field have been directed towards developing techniques aimed at enhancing the efficiency of LLM inference. This paper presents a comprehensive survey of the existing literature on efficient LLM inference. We start by analyzing the primary causes of the inefficient LLM inference, i.e., the large model size, the quadratic-complexity attention operation, and the auto-regressive decoding approach. Then, we introduce a comprehensive taxonomy that organizes the current literature into data-level, model-level, and system-level optimization. Moreover, the paper includes comparative experiments on representative methods within critical sub-fields to provide quantitative insights. Last but not least, we provide some knowledge summary and discuss future research directions.

Read more

6/11/2024

💬

One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models

Hang Shao, Bei Liu, Bo Xiao, Ke Zeng, Guanglu Wan, Yanmin Qian

YC

0

Reddit

0

Various Large Language Models~(LLMs) from the Generative Pretrained Transformer(GPT) family have achieved outstanding performances in a wide range of text generation tasks. However, the enormous model sizes have hindered their practical use in real-world applications due to high inference latency. Therefore, improving the efficiencies of LLMs through quantization, pruning, and other means has been a key issue in LLM studies. In this work, we propose a method based on Hessian sensitivity-aware mixed sparsity pruning to prune LLMs to at least 50% sparsity without the need of any retraining. It allocates sparsity adaptively based on sensitivity, allowing us to reduce pruning-induced error while maintaining the overall sparsity level. The advantages of the proposed method exhibit even more when the sparsity is extremely high. Furthermore, our method is compatible with quantization, enabling further compression of LLMs. We have released the available code.

Read more

4/24/2024