M\'elange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity

Griggs, Tyler; Liu, Xiaoxuan; Yu, Jiaxiang; Kim, Doyoung; Chiang, Wei-Lin; Cheung, Alvin; Stoica, Ion

Full-text links:

Download:

Current browse context:

cs.DC

< prev | next >

new | recent | 2404

Computer Science > Distributed, Parallel, and Cluster Computing

Title: Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity

Authors: Tyler Griggs, Xiaoxuan Liu, Jiaxiang Yu, Doyoung Kim, Wei-Lin Chiang, Alvin Cheung, Ion Stoica

(Submitted on 22 Apr 2024)

Abstract: Large language models (LLMs) are increasingly integrated into many online services. However, a major challenge in deploying LLMs is their high cost, due primarily to the use of expensive GPU instances. To address this problem, we find that the significant heterogeneity of GPU types presents an opportunity to increase GPU cost efficiency and reduce deployment costs. The broad and growing market of GPUs creates a diverse option space with varying costs and hardware specifications. Within this space, we show that there is not a linear relationship between GPU cost and performance, and identify three key LLM service characteristics that significantly affect which GPU type is the most cost effective: model request size, request rate, and latency service-level objective (SLO). We then present M\'elange, a framework for navigating the diversity of GPUs and LLM service specifications to derive the most cost-efficient set of GPUs for a given LLM service. We frame the task of GPU selection as a cost-aware bin-packing problem, where GPUs are bins with a capacity and cost, and items are request slices defined by a request size and rate. Upon solution, M\'elange derives the minimal-cost GPU allocation that adheres to a configurable latency SLO. Our evaluations across both real-world and synthetic datasets demonstrate that M\'elange can reduce deployment costs by up to 77% as compared to utilizing only a single GPU type, highlighting the importance of making heterogeneity-aware GPU provisioning decisions for LLM serving. Our source code is publicly available at this https URL

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Cite as:	arXiv:2404.14527 [cs.DC]
	(or arXiv:2404.14527v1 [cs.DC] for this version)

Submission history

From: Tyler Griggs [view email]
[v1] Mon, 22 Apr 2024 18:56:18 GMT (4990kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2404.14527

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Distributed, Parallel, and Cluster Computing

Title: Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity

Submission history