包阅导读总结
1. 关键词:LLM、GKE、NVIDIA GPU、量化、模型服务
2. 总结:本文探讨在 GKE 上为 LLM 服务选择 GPU 的相关问题,包括模型量化、机器类型选择、GPU 选择,还介绍了性能基准测试工具,给出一些优化建议,并通过表格展示不同模型和 GPU 的性能特征。
3. 主要内容:
– 引言
– 指出服务 AI 基础模型成本高,GKE 可通过特性帮助降低成本。
– 介绍创建了 GKE 的性能基准测试工具。
– 基础设施决策
– 量化模型
– 介绍量化技术减少内存需求,不同量化精度对精度和性能的影响。
– 建议使用量化节省内存和成本,低精度量化前评估模型精度。
– 选择机器类型
– 依据模型参数和权重数据类型计算所需机器类型。
– 给出不同模型在不同精度下的内存需求参考。
– 建议根据模型内存需求选 NVIDIA GPU,不足时用多 GPU 分片和张量并行。
– 选择 GPU
– 介绍 GKE 中 NVIDIA GPU 种类及性能特性。
– 说明选择最优加速器要考虑 GPU 内存、带宽和 FLOPS 等维度。
– 比较 G2 和 A3 对不同 Llama 模型的吞吐量/成本。
思维导图:
文章地址:https://cloud.google.com/blog/products/ai-machine-learning/selecting-gpus-for-llm-serving-on-gke/
文章来源:cloud.google.com
作者:Ashok Chandrasekar,Anna Pendleton
发布时间:2024/8/23 0:00
语言:英文
总字数:1992字
预计阅读时间:8分钟
评分:84分
标签:大型语言模型,GPU,GKE,量化,推理优化
以下为原文内容
本内容来源于用户推荐转载,旨在分享知识与观点,如有侵权请联系删除 联系邮箱 media@ilingban.com
Let’s face it: Serving AI foundation models such as large language models (LLMs) can be expensive. Between the need for hardware accelerators to achieve lower latency and the fact that these accelerators are typically not efficiently utilized, organizations need an AI platform that can serve LLMs at scale while minimizing the cost per token. Through features like workload and infrastructure autoscaling and load balancing, Google Kubernetes Engine (GKE) can help you do just that.
When integrating LLMs into an application, you need to consider how to serve it cost effectively while still providing the highest throughput within a certain latency bound. To help, we created a performance benchmarking tool for GKE that automates the end-to-end setup — from cluster creation to inference server deployment and benchmarking — which you can use to measure and analyze these performance tradeoffs.
Below are some recommendations that can help you maximize your serving throughput on NVIDIA GPUs on GKE. Combining these recommendations with the performance benchmarking tool will enable you to make data-driven decisions when setting up your inference stack on GKE. We also touch on how to optimize a model server platform for a given inference workload.
Infrastructure decisions
When selecting infrastructure that fits your model and is cost effective, you need to answer the following questions:
-
Should you quantize your model? And if so, which quantization should you use?
-
How do you pick a machine type to fit your model?
-
Which GPU should you use?
Let’s take a deeper look at these questions.
1. Should you quantize your model? Which quantization should you use?
Quantization is a technique that decreases the amount of accelerator memory required to load the model weights; it does so by representing weights and activations with a lower-precision data type. Quantization results in cost savings and can improve latency and throughput due to having a lower memory footprint. But at a certain point, quantizing a model results in a noticeable loss in model accuracy.
Among quantization types, FP16 and Bfloat16 quantization provide virtually the same accuracy as FP32 (depending on the model, as shown in this paper), with half the memory usage. Most newer model checkpoints are already published in 16-bit precision. FP8 and INT8 can provide up to a 50% reduction in memory usage for model weights (KV cache will still consume similar memory if not quantized separately by the model server), often with minimal loss of accuracy.
Accuracy suffers with quantization of 4-bit or less, like INT4 or INT3. Be sure to evaluate model accuracy before using 4-bit quantization. There are also some post-training quantization techniques such as Activation Aware Quantization (AWQ) that can help reduce a loss of accuracy.
You can deploy the model of your choice with different quantization techniques using our automation tool and evaluate them to see how it fits your needs.
Recommendation 1: Use quantization to save memory and cost. If you use less than 8-bit precision, do so only after evaluating model accuracy.
2. How do you pick a machine type to fit your model?
A simple way to calculate the machine type that you need is to consider the number of parameters in the model and the data type of the model weights.
model size (in bytes) = # of model parameters * data type in bytes
Thus, for a 7b model using 16-bit precision quantization technique such as FP16 or BF16, you would need:
7 billion * 2 bytes = 14 billion bytes = 14 GiB
Likewise, for a 7b model in 8-bit precision such as FP8 or INT8, you’d need:
7 billion * 1 byte = 7 billion bytes = 7 GiB
In the table below, we’ve applied these guidelines to show how much accelerator memory you might need for some popular open-weight LLMs.
Models |
Model Variants (# of parameters) |
Model Size (GPU memory needed in GB) |
||
FP16 |
8-bit precision |
4-bit precision |
||
Gemma |
2b |
4 |
2 |
1 |
7b |
14 |
7 |
3.5 |
|
Llama 3 |
8b |
16 |
8 |
4 |
70b |
140 |
70 |
35 |
|
Falcon |
7b |
14 |
14 |
3.5 |
40b |
80 |
40 |
20 |
|
180b |
360 |
180 |
90 |
|
Flan T5 |
11b |
22 |
11 |
5.5 |
Bloom |
176b |
352 |
176 |
88 |
Note: The above table is provided just as a reference. The exact number of parameters may be different from the number of parameters mentioned in the model name, e.g., Llama 3 8b or Gemma 7b. For open models, you can find the exact number of parameters in the Hugging Face model card page.
A best practice is to choose an accelerator that allocates up to 80% of its memory for model weights, saving 20% for the KV cache (key-value cache used by the model server for efficient token generation). For example, for a G2 machine type with a single NVIDIA L4 Tensor Core GPU (24GB), you can use 19.2 GB for model weights (24 * 0.8). Depending on the token length and the number of requests served, you might need up to 35% for the KV cache. For very long context lengths like 1M tokens, you will need to allocate even more memory for the KV cache and expect it to dominate the memory usage.
Recommendation 2: Choose your NVIDIA GPU based on the memory requirements of the model. When a single GPU is not enough, use multi-GPU sharding with tensor parallelism.
3. Which GPU should you use?
GKE offers a wide variety of VMs powered by NVIDIA GPUs. How do you decide which GPU to use to serve your model?
The table below shows a list of popular NVIDIA GPUs for running inference workloads along with their performance characteristics.
Compute Engine Instance |
NVIDIA GPU |
Memory |
HBM bandwidth |
Mixed-precision FP16/FP32/bfloat16 Tensor core peak compute |
A3 |
NVIDIA H100 Tensor Core GPU |
80 GB |
HBM3 @ 3.35 TBps |
1,979 TFLOPS |
A2 ultra |
NVIDIA A100 80GB Tensor Core GPU |
80 GB |
HBM2e @ 1.9 TBps |
312 TFLOPS |
A2 |
NVIDIA A100 40GB Tensor Core GPU |
40 GB |
HBM2 @ 1.6 TBps |
312 TFLOPS |
G2 |
NVIDIA L4 Tensor Core GPU |
24 GB |
GDDR6 @ 300 GBps |
242 TFLOPS |
Note: A3 and G2 VMs support structural sparsity, which you can use to achieve more performance. The values shown are with sparsity. Without sparsity, the listed specifications are one-half lower.
Based on a model’s characteristics, throughput and latency can be bound by three different dimensions:
-
Throughput may be bound by GPU memory (GB / GPU): GPU memory holds the model weights and KV cache. Batching increases throughput, but the KV cache growth eventually hits a memory limit.
-
Latency may be bound by GPU HBM bandwidth (GB/s): Model weights and KV cache state are read for every single token that is generated, relying on the memory bandwidth.
-
For larger models, latency may be bound by GPU FLOPS: Tensor computations rely on the GPU FLOPS. More hidden layers and attention heads indicate more FLOPS being utilized.
Consider these dimensions when choosing the optimal accelerator to fit your latency and throughput needs.
Below, we compare the throughput/$ of G2 and A3 for Llama 2 7b and Llama 2 70b models. The chart uses normalized throughput/$ where G2’s performance is set to 1 and A3’s performance is compared against it.