Fugu-MT 論文翻訳(概要): T-REN: Learning Text-Aligned Region Tokens Improves Dense Vision-Language Alignment and Scalability

論文の概要: T-REN: Learning Text-Aligned Region Tokens Improves Dense Vision-Language Alignment and Scalability

arxiv url: http://arxiv.org/abs/2604.18573v1
Date: Mon, 20 Apr 2026 17:57:02 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:53.037802
Title: T-REN: Learning Text-Aligned Region Tokens Improves Dense Vision-Language Alignment and Scalability
Title（参考訳）: T-REN: テキスト対応の領域トークンの学習により、高次元視界調整とスケーラビリティが向上
Authors: Savya Khosla, Sethuraman T, Aryan Chadha, Alex Schwing, Derek Hoiem,
Abstract要約: T-REN(テキスト整列領域ネットワーク)は、視覚データをテキスト整列領域レベル表現(または領域トークン)のコンパクトなセットにマッピングする効率的なエンコーダである。 T-RENは、凍結したビジョンバックボーン上に追加された軽量ネットワークを通じてこれを達成し、各セマンティック領域内のパッチレベルの表現をリージョントークンにプールし、リージョンレベルのテキストアノテーションと整合させるように訓練する。
参考スコア（独自算出の注目度）: 10.51971591757392
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite recent progress, vision-language encoders struggle with two core limitations: (1) weak alignment between language and dense vision features, which hurts tasks like open-vocabulary semantic segmentation; and (2) high token counts for fine-grained visual representations, which limits scalability to long videos. This work addresses both limitations. We propose T-REN (Text-aligned Region Encoder Network), an efficient encoder that maps visual data to a compact set of text-aligned region-level representations (or region tokens). T-REN achieves this through a lightweight network added on top of a frozen vision backbone, trained to pool patch-level representations within each semantic region into region tokens and align them with region-level text annotations. With only 3.7% additional parameters compared to the vision-language backbone, this design yields substantially stronger dense cross-modal understanding while reducing the token count by orders of magnitude. Specifically, T-REN delivers +5.9 mIoU on ADE20K open-vocabulary segmentation, +18.4% recall on COCO object-level text-image retrieval, +15.6% recall on Ego4D video object localization, and +17.6% mIoU on VSPW video scene parsing, all while reducing token counts by more than 24x for images and 187x for videos compared to the patch-based vision-language backbone. The code and model are available at https://github.com/savya08/T-REN.
Abstract（参考訳）: 最近の進歩にもかかわらず、視覚言語エンコーダは、(1) 言語と密集した視覚特徴の弱い一致が、オープン語彙セマンティックセグメンテーションのようなタスクを損なう、(2) 細粒度の視覚表現のための高いトークン数、そして、スケーラビリティを長いビデオに制限する、という2つの中核的な制限に悩まされている。この作業は両方の制限に対処します。テキスト整列領域レベル表現(または領域トークン)のコンパクトなセットに視覚データをマッピングする効率的なエンコーダであるT-REN(Text-aligned Region Encoder Network)を提案する。 T-RENは、凍結したビジョンバックボーン上に追加された軽量ネットワークを通じてこれを達成し、各セマンティック領域内のパッチレベルの表現をリージョントークンにプールし、リージョンレベルのテキストアノテーションと整合させるように訓練する。視覚言語によるバックボーンに比べてわずか3.7%のパラメータしか追加されていないため、この設計はトークンの数を桁違いに減らしながら、より強い密接なクロスモーダル理解をもたらす。具体的には、ADE20Kのオープン語彙セグメンテーションに+5.9 mIoU、COCOオブジェクトレベルのテキストイメージ検索に+18.4%、Ego4Dビデオオブジェクトのローカライゼーションに+15.6%、VSPWビデオシーン解析に+17.6%のmIoUを提供する。コードとモデルはhttps://github.com/savya08/T-REN.comで公開されている。

論文の概要: T-REN: Learning Text-Aligned Region Tokens Improves Dense Vision-Language Alignment and Scalability

関連論文リスト