MOONCAKE: Trading More Storage for Less Computation —— A KVCache-centric Architecture for Serving LLM Chatbot

Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, Xinran Xu

February 2025

Abstract

MOONCAKE is the serving platform for Kimi, an LLM chatbot service developed by Moonshot AI. This platform features a KVCache-centric disaggregated architecture that not only separates prefill and decoding clusters but also efficiently utilizes the underexploited CPU, DRAM, SSD and NIC resources of the GPU cluster to establish a disaggregated KV-Cache. At the core of MOONCAKE is its KVCache-centric global cache and a scheduler designed to maximize throughput while adhering to stringent latency-related Service Level Objectives (SLOs).

Our experiments demonstrate that MOONCAKE excels in scenarios involving long-context inputs. In tests using real traces, MOONCAKE increases the effective request capacity by 59%~498% when compared to baseline methods, all while complying with SLOs. Currently, MOONCAKE is operational across thousands of nodes, processing over 100 billion tokens daily. In practical deployments, MOONCAKE’s innovative architecture enables Kimi to handle 115% and 107% more requests on NVIDIA A800 and H800 clusters, respectively, compared to previous systems.

Type

Conference paper

Publication

23th USENIX Conference on File and Storage Technologies