View a PDF of the paper titled Batch-Max: Higher LLM Throughput using Larger Batch Sizes and KV Cache Compression, by Michael R. Metel and 2 other authors
View PDF
HTML (experimental)
Abstract:Several works have developed eviction policies to remove key-value (KV) pairs from the KV cache for more efficient inference. The focus has been on compressing the KV cache after the input prompt has been processed for faster token generation. In settings with limited GPU memory, and when the input context is longer than the generation length, we show that by also compressing the KV cache during the input processing phase, larger batch sizes can be used resulting in significantly higher throughput while still maintaining the original model’s accuracy.
Submission history
From: Michael Metel R [view email]
[v1]
Sat, 7 Dec 2024 16:41:54 UTC (27 KB)
[v2]
Tue, 17 Jun 2025 02:24:51 UTC (29 KB)
[v3]
Thu, 3 Jul 2025 16:06:35 UTC (29 KB)