Fugu-MT 論文翻訳(概要): Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live

論文の概要: Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live

arxiv url: http://arxiv.org/abs/2511.02230v1
Date: Tue, 04 Nov 2025 03:43:05 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-05 18:47:05.798261
Title: Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live
Title（参考訳）: 継続性:KVキャッシュによる高効率かつロバストなマルチTurn LLMエージェントスケジューリング
Authors: Hanchen Li, Qiuyang Mang, Runyuan He, Qizheng Zhang, Huanzhi Mao, Xiaokun Chen, Alvin Cheung, Joseph Gonzalez, Ion Stoica,
Abstract要約: Continuumは、マルチターンエージェントワークロードのジョブ完了時間を最適化するサービスシステムである。エージェントのツールコール時間を予測することで、Continuumは全ターン数に基づいて、KVキャッシュをGPUメモリに選択的にピン留めする。 Llama-3.1 8B/70Bモデルを用いた実世界のエージェントワークロードに対する評価は、Continuumが平均ジョブ完了時間を大幅に改善することを示している。
参考スコア（独自算出の注目度）: 30.099614426825834
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Agentic LLM applications interleave LLM generation requests with tool calls. These tool calls break the continuity of the workflow by creating pauses between LLM requests, bringing many challenges for the serving system, especially under multi-turn scenarios. Each pause potentially causes KV cache eviction and extra waiting time before entering the continuous batch for the following LLM request. Since these pauses happen for each call, this problem becomes increasingly severe as turn number grow for agentic programs. Previous works either fail to incorporate information from the tool call, evicting KV cache that leads to repetitive prefill or loading, or ignore the continuity of a multi-turn program, creating waiting time between turns that increases per-request latency. We present Continuum, a serving system to optimize job completion time for multi-turn agent workloads by combining tool-aware KV cache timeout with program-level scheduling. By predicting tool call durations in agentic workflows, Continuum selectively pins the KV cache in GPU memory with a time-to-live value based on total turn number. When combined with program-level first-come-first-serve, Continuum prevents scheduling bubbles, preserves multi-turn continuity, and optimizes for throughput for complex agentic workflows. By modeling the variability of tool call and agent program continuity, Continuum outperforms state-of-the-art baselines. Our evaluation on real-world agentic workloads (SWE-Bench and BFCL) with Llama-3.1 8B/70B models shows that Continuum significantly improves the average job completion times, and remains performant across different hardware setups and DRAM offloading schemes. Preview code is available at: https://github.com/Hanchenli/vllm-continuum
Abstract（参考訳）: エージェント LLM アプリケーションは LLM 生成要求をツールコールでインターリーブする。これらのツールコールは、LLMリクエスト間の一時停止を生成してワークフローの連続性を壊し、特にマルチターンシナリオにおいて、サービスシステムに多くの課題をもたらす。各一時停止は、次のLLMリクエストの連続バッチに入る前に、KVキャッシュの消去と余分な待ち時間を引き起こす可能性がある。これらの停止は各呼び出し毎に発生するため、エージェントプログラムのターン数が増加するにつれて、この問題はますます深刻化する。以前の作業では、ツールコールからの情報を組み込むことができず、繰り返しプリフィルやロードにつながるKVキャッシュを排除したり、マルチターンプログラムの継続性を無視したり、リクエスト毎のレイテンシを増大させるターン間の待ち時間を生成する。ツールを意識したKVキャッシュタイムアウトとプログラムレベルのスケジューリングを組み合わせることで,マルチターンエージェントワークロードのジョブ完了時間を最適化するサービスシステムであるContinuumを提案する。エージェントワークフローにおけるツールコール時間を予測することで、Continuumは、全ターン数に基づいて、KVキャッシュをGPUメモリに選択的にピン留めする。プログラムレベルのファーストカムファーストサービスと組み合わせると、Continuumはバブルのスケジューリングを防ぎ、マルチターン連続性を保ち、複雑なエージェントワークフローのスループットを最適化する。ツールコールとエージェントプログラムの連続性の変動をモデル化することで、Continuumは最先端のベースラインより優れています。 Llama-3.1 8B/70Bモデルを用いた実世界のエージェントワークロード(SWE-BenchとBFCL)の評価では、Continuumは平均ジョブ完了時間を大幅に改善し、異なるハードウェアセットアップとDRAMオフロードスキームでパフォーマンスが保たれている。プレビューコードは、https://github.com/Hanchenli/vllm-continuum.comで入手できる。

論文の概要: Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live

関連論文リスト