Fugu-MT 論文翻訳(概要): Sliding Window Attention for Learned Video Compression

論文の概要: Sliding Window Attention for Learned Video Compression

arxiv url: http://arxiv.org/abs/2510.03926v1
Date: Sat, 04 Oct 2025 20:11:43 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-07 16:52:59.346944
Title: Sliding Window Attention for Learned Video Compression
Title（参考訳）: 学習映像圧縮のためのスライディングウィンドウアテンション
Authors: Alexander Kopte, André Kaup,
Abstract要約: 本研究は3D Sliding Window Attention (SWA)を導入している。 Bjorntegaard Delta-rate saves to up 18.6% %。
参考スコア（独自算出の注目度）: 67.57073402826292
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: To manage the complexity of transformers in video compression, local attention mechanisms are a practical necessity. The common approach of partitioning frames into patches, however, creates architectural flaws like irregular receptive fields. When adapted for temporal autoregressive models, this paradigm, exemplified by the Video Compression Transformer (VCT), also necessitates computationally redundant overlapping windows. This work introduces 3D Sliding Window Attention (SWA), a patchless form of local attention. By enabling a decoder-only architecture that unifies spatial and temporal context processing, and by providing a uniform receptive field, our method significantly improves rate-distortion performance, achieving Bj{\o}rntegaard Delta-rate savings of up to 18.6 % against the VCT baseline. Simultaneously, by eliminating the need for overlapping windows, our method reduces overall decoder complexity by a factor of 2.8, while its entropy model is nearly 3.5 times more efficient. We further analyze our model's behavior and show that while it benefits from long-range temporal context, excessive context can degrade performance.
Abstract（参考訳）: ビデオ圧縮における変圧器の複雑さを管理するためには,局所的な注意機構が不可欠である。しかし、フレームをパッチに分割する一般的なアプローチは、不規則な受容フィールドのようなアーキテクチャ上の欠陥を生み出します。時間的自己回帰モデルに適応する場合、このパラダイムはビデオ圧縮変換器(VCT)によって例示され、計算的に冗長なオーバーラップウインドウを必要とする。本研究は3D Sliding Window Attention (SWA)を導入している。空間的および時間的コンテキスト処理を統一するデコーダのみのアーキテクチャを実現し、均一な受容場を提供することで、Bj{\o}rntegaard Delta-rate の最大 18.6 % を VCT ベースラインに対して達成し、速度歪曲性能を著しく向上させる。同時に、重なり合うウィンドウの必要性をなくすことで、エントロピーモデルは3.5倍の効率で、デコーダ全体の複雑さを2.8倍に削減する。さらに、モデルの振る舞いを分析し、それが長期の時間的コンテキストの恩恵を受ける一方で、過剰なコンテキストはパフォーマンスを低下させる可能性があることを示す。

論文の概要: Sliding Window Attention for Learned Video Compression

関連論文リスト