Fugu-MT 論文翻訳(概要): Towards Generalization of Block Attention via Automatic Segmentation and Block Distillation

論文の概要: Towards Generalization of Block Attention via Automatic Segmentation and Block Distillation

arxiv url: http://arxiv.org/abs/2605.15913v2
Date: Thu, 21 May 2026 06:50:33 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-22 16:35:41.858832
Title: Towards Generalization of Block Attention via Automatic Segmentation and Block Distillation
Title（参考訳）: 自動セグメンテーションとブロック蒸留によるブロック注意の一般化に向けて
Authors: Shuaiyi Li, Zhisong Zhang, Yan Wang, Lei Zhu, Dongyang Ma, Chenlong Deng, Yang Deng, Wai Lam,
Abstract要約: ブロックアテンションは、Retrieval-Augmented Generation (RAG)のような長期コンテキストシナリオにおけるKVキャッシュの再利用を改善することができる。しかし、入力テキストを意味のある自己完結ブロックに分割することの難しさと、性能低下のリスクを負う既存のブロック微調整手法の非効率性である。ブロック微細チューニングよりも効率的な訓練フレームワークであるブロック蒸留を提案し, 凍結したフルアテンション教師モデルを用いて, ブロックアテンション学生を指導する。
参考スコア（独自算出の注目度）: 61.19473093799777
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Block attention, which processes the input as separate blocks that cannot attend to one another, offers significant potential to improve KV cache reuse in long-context scenarios such as Retrieval-Augmented Generation (RAG). However, its broader application is hindered by two key challenges: the difficulty of segmenting input text into meaningful, self-contained blocks, and the inefficiency of existing block fine-tuning methods that risk degrading performance. To address these, we first construct SemanticSeg, a large and diverse semantic segmentation dataset containing over 30k instances across 16 categories-including books, code, web text, and conversations with text lengths ranging from 2k to 32k. Using this dataset, we train a lightweight segmenter to automatically partition text into human-instinct-aligned blocks with controllable granularity. Second, we propose block distillation, a training framework that is more efficient than block fine-tuning, which uses a frozen full-attention teacher model to guide the block-attention student. This framework integrates three novel components: block sink tokens to mitigate information loss at block boundaries, block dropout to leverage training signals from all blocks, and token-level loss weighting to focus learning on block-attention-sensitive tokens. Experiments across multiple models and benchmarks demonstrate that our segmenter outperforms heuristic and statistical baselines, and block distillation achieves near-full-attention performance under block attention, establishing a practical and scalable pathway for deploying block attention.
Abstract（参考訳）: ブロックアテンションは、入力を互いに参加できない別々のブロックとして処理するが、Retrieval-Augmented Generation (RAG)のような長いコンテキストシナリオにおいて、KVキャッシュの再利用を改善する大きな可能性を秘めている。しかし、入力テキストを意味のある自己完結ブロックに分割することの難しさと、性能低下のリスクを負う既存のブロック微調整手法の非効率性である。これらの問題に対処するために、まずSemanticSegを構築した。SemanticSegは、書籍、コード、Webテキスト、テキストの長さ2kから32kの会話を含む16のカテゴリにわたる30k以上のインスタンスを含む、大規模で多様なセマンティックセマンティックセマンティックセマンティックデータセットである。このデータセットを使用して、軽量セグメンタをトレーニングし、テキストを自動的に人間の本能的なブロックに分割し、粒度を制御可能にします。第2に,ブロック微細チューニングよりも効率的な訓練フレームワークであるブロック蒸留を提案し,凍結したフルアテンション教師モデルを用いて,ブロックアテンション学生を指導する。このフレームワークは3つの新しいコンポーネントを統合している。ブロックシンクトークンはブロック境界における情報損失を軽減するために、ブロックドロップアウトはすべてのブロックからのトレーニング信号を活用する。複数のモデルとベンチマークで実験したところ、セグメンタはヒューリスティックなベースラインや統計的ベースラインよりも優れており、ブロック蒸留はブロックアテンションの下でほぼ完全なアテンション性能を実現し、ブロックアテンションを展開するための実用的でスケーラブルな経路を確立している。

論文の概要: Towards Generalization of Block Attention via Automatic Segmentation and Block Distillation

関連論文リスト