Fugu-MT 論文翻訳(概要): LRAgent: Efficient KV Cache Sharing for Multi-LoRA LLM Agents

論文の概要: LRAgent: Efficient KV Cache Sharing for Multi-LoRA LLM Agents

arxiv url: http://arxiv.org/abs/2602.01053v1
Date: Sun, 01 Feb 2026 06:36:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-03 19:28:33.559766
Title: LRAgent: Efficient KV Cache Sharing for Multi-LoRA LLM Agents
Title（参考訳）: LRAgent: マルチLORA LLMエージェントのための効率的なKVキャッシュ共有
Authors: Hyesung Jeon, Hyeongju Ha, Jae-Joon Kim,
Abstract要約: マルチLoRAエージェントのためのKVキャッシュ共有フレームワークであるLRAgentを提案する。 LRAgentはキャッシュを、事前訓練された重みから共有ベースコンポーネント、LoRA重みからアダプタ依存コンポーネントに分解する。 LRAgentは、完全に共有されたキャッシュに近いスループットとタイムツーファーストのレイテンシを実現する。
参考スコア（独自算出の注目度）: 9.162948089580143
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Role specialization in multi-LLM agent systems is often realized via multi-LoRA, where agents share a pretrained backbone and differ only through lightweight adapters. Despite sharing base model weights, each agent independently builds and stores its own KV cache for the same long, tool-augmented trajectories, incurring substantial memory and compute overhead. Existing KV cache sharing methods largely overlook this multi-LoRA setting. We observe that, across agents, cache differences are dominated by adapter outputs, while activations from the shared pretrained backbone remain highly similar. Based on this observation, we propose LRAgent, a KV cache sharing framework for multi-LoRA agents that decomposes the cache into a shared base component from the pretrained weights and an adapter-dependent component from LoRA weights. LRAgent reduces memory overhead by sharing the base component and storing the adapter component in its inherent low-rank form, and further reduces compute overhead, enabled by shared-$A$ multi-LoRA architectures, by also sharing the low-rank cache and avoiding redundant computations for contexts already processed by other agents. To efficiently reconstruct adapter contributions at runtime, we introduce Flash-LoRA-Attention, a kernel that reorders attention computation to avoid materializing the low-rank cache to full dimension. LRAgent achieves throughput and time-to-first-token latency close to fully shared caching, while preserving accuracy near the non-shared caching baseline across agentic question-answering benchmarks.
Abstract（参考訳）: マルチLLMエージェントシステムにおける役割特化は、エージェントが予め訓練されたバックボーンを共有し、軽量アダプタによってのみ異なるマルチLoRAによって実現されることが多い。ベースモデルの重みを共有するにもかかわらず、各エージェントは独立して、ツール拡張された同じ長いトラジェクトリに対して独自のKVキャッシュを構築し、保存し、かなりのメモリと計算オーバーヘッドを発生させる。既存のKVキャッシュ共有方法は、このマルチLoRA設定をほとんど見落としている。エージェント間のキャッシュ差がアダプタ出力に支配されているのに対して、共有事前学習されたバックボーンからのアクティベーションは、非常によく似ている。そこで本研究では,マルチLORAエージェントのKVキャッシュ共有フレームワークであるLRAgentを提案し,キャッシュを事前トレーニングした重みから共有ベースコンポーネントに分解し,LoRA重みからアダプタ依存コンポーネントを分離する。 LRAgentは、ベースコンポーネントを共有し、アダプタコンポーネントを固有のローランク形式で保存することでメモリオーバーヘッドを減らし、また、共有$A$マルチLORAアーキテクチャによって実現される計算オーバーヘッドを減らし、ローランクキャッシュの共有と、他のエージェントがすでに処理しているコンテキストに対する冗長な計算の回避を図っている。実行時にアダプタのコントリビューションを効率的に再構築するために,低ランクキャッシュをフル次元にするのを避けるためにアテンション計算をリオーダするカーネルであるFlash-LoRA-Attentionを導入する。 LRAgentは、完全に共有されたキャッシュに近いスループットとタイム・ツー・ファーストのレイテンシを実現し、エージェントの問合せベンチマークで非共有キャッシュベースラインに近い精度を保っている。

論文の概要: LRAgent: Efficient KV Cache Sharing for Multi-LoRA LLM Agents

関連論文リスト