Fugu-MT 論文翻訳(概要): SemantiCache: Efficient KV Cache Compression via Semantic Chunking and Clustered Merging

論文の概要: SemantiCache: Efficient KV Cache Compression via Semantic Chunking and Clustered Merging

arxiv url: http://arxiv.org/abs/2603.14303v1
Date: Sun, 15 Mar 2026 09:36:25 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 16:19:35.733624
Title: SemantiCache: Efficient KV Cache Compression via Semantic Chunking and Clustered Merging
Title（参考訳）: SemantiCache: セマンティックチャンキングとクラスタリングによる効率的なKVキャッシュ圧縮
Authors: Shunlong Wu, Hai Lin, Shaoshen Chen, Tingwei Lu, Yongqin Zeng, Shaoxiong Zhan, Hai-Tao Zheng, Hong-Gee Kim,
Abstract要約: SemantiCacheはセマンティックな整合性を維持する新しい圧縮フレームワークである。まず、キャッシュを意味的に一貫性のあるチャンクに分割します。各チャンク内には,トークンをセマンティッククラスタにグループ化するGreedy Seed-Based Clustering (GSC)アルゴリズムが導入されている。これらのクラスタはさらにセマンティックコアにマージされ、Proportional Attentionメカニズムによって強化される。
参考スコア（独自算出の注目度）: 14.82266992933174
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing KV cache compression methods generally operate on discrete tokens or non-semantic chunks. However, such approaches often lead to semantic fragmentation, where linguistically coherent units are disrupted, causing irreversible information loss and degradation in model performance. To address this, we introduce SemantiCache, a novel compression framework that preserves semantic integrity by aligning the compression process with the semantic hierarchical nature of language. Specifically, we first partition the cache into semantically coherent chunks by delimiters, which are natural semantic boundaries. Within each chunk, we introduce a computationally efficient Greedy Seed-Based Clustering (GSC) algorithm to group tokens into semantic clusters. These clusters are further merged into semantic cores, enhanced by a Proportional Attention mechanism that rebalances the reduced attention contributions of the merged tokens. Extensive experiments across diverse benchmarks and models demonstrate that SemantiCache accelerates the decoding stage of inference by up to 2.61 times and substantially reduces memory footprint, while maintaining performance comparable to the original model.
Abstract（参考訳）: 既存のKVキャッシュ圧縮手法は一般に離散トークンや非意味チャンクで動作する。しかし、そのようなアプローチはしばしば意味的断片化を引き起こし、言語的に一貫性のある単位が破壊され、不可逆的な情報損失とモデル性能の低下を引き起こす。これを解決するために,セマンティキャッシュを紹介した。セマンティキャッシュは,セマンティックな階層的な言語の性質と圧縮プロセスの整合性を維持することで意味的整合性を維持する新しい圧縮フレームワークである。具体的には、まずキャッシュを、自然なセマンティック境界であるデリミタによってセマンティックコヒーレントなチャンクに分割する。各チャンク内で,トークンをセマンティッククラスタにグループ化する,計算効率のよいGreedy Seed-Based Clustering (GSC)アルゴリズムを導入する。これらのクラスタはさらにセマンティックコアにマージされ、マージされたトークンの注目度を減少させるReportional Attentionメカニズムによって強化される。様々なベンチマークやモデルにわたる大規模な実験により、SemantiCacheは推論の復号段階を最大2.61倍に加速し、メモリフットプリントを大幅に削減し、元のモデルに匹敵する性能を維持している。

論文の概要: SemantiCache: Efficient KV Cache Compression via Semantic Chunking and Clustered Merging

関連論文リスト