Fugu-MT 論文翻訳(概要): Rethinking Token Reduction for Large Vision-Language Models

論文の概要: Rethinking Token Reduction for Large Vision-Language Models

arxiv url: http://arxiv.org/abs/2603.21701v1
Date: Mon, 23 Mar 2026 08:40:08 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-24 19:11:39.573678
Title: Rethinking Token Reduction for Large Vision-Language Models
Title（参考訳）: 大規模ビジョンランゲージモデルにおけるトークン削減の再考
Authors: Yi Wang, Haofei Zhang, Qihan Huang, Anda Cao, Gongfan Fang, Wei Wang, Xuan Jin, Jie Song, Mingli Song, Xinchao Wang,
Abstract要約: LVLM(Large-Language Models)は、視覚的理解と推論において優れているが、過度の視覚トークンは高い推論コストをもたらす。一般化設計の限界を克服し,メタコンプレックスと呼ばれる学習に基づくプロンプト非依存型メタコンプレックスを提案する。
参考スコア（独自算出の注目度）: 95.48478689025696
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Vision-Language Models (LVLMs) excel in visual understanding and reasoning, but the excessive visual tokens lead to high inference costs. Although recent token reduction methods mitigate this issue, they mainly target single-turn Visual Question Answering (VQA), leaving the more practical multi-turn VQA (MT-VQA) scenario largely unexplored. MT-VQA introduces additional challenges, as subsequent questions are unknown beforehand and may refer to arbitrary image regions, making existing reduction strategies ineffective. Specifically, current approaches fall into two categories: prompt-dependent methods, which bias toward the initial text prompt and discard information useful for subsequent turns; prompt-agnostic ones, which, though technically applicable to multi-turn settings, rely on heuristic reduction metrics such as attention scores, leading to suboptimal performance. In this paper, we propose a learning-based prompt-agnostic method, termed MetaCompress, overcoming the limitations of heuristic designs. We begin by formulating token reduction as a learnable compression mapping, unifying existing formats such as pruning and merging into a single learning objective. Upon this formulation, we introduce a data-efficient training paradigm capable of learning optimal compression mappings with limited computational costs. Extensive experiments on MT-VQA benchmarks and across multiple LVLM architectures demonstrate that MetaCompress achieves superior efficiency-accuracy trade-offs while maintaining strong generalization across dialogue turns. Our code is available at https://github.com/MArSha1147/MetaCompress.
Abstract（参考訳）: LVLM(Large Vision-Language Models)は、視覚的理解と推論において優れているが、過度の視覚トークンは高い推論コストをもたらす。最近のトークン削減手法はこの問題を軽減するが、主にシングルターン視覚質問応答 (VQA) をターゲットにしており、より実用的なマルチターンVQA (MT-VQA) のシナリオはほとんど検討されていない。 MT-VQAは、後続の質問が事前に不明であり、任意の画像領域を参照し、既存の縮小戦略を効果的にしないため、さらなる課題を提起する。特に、現在のアプローチは2つのカテゴリに分類される: 初期テキストのプロンプトに偏りを持つプロンプト依存手法と、その後のターンに有用な情報を捨てるプロンプト依存手法。本稿では,ヒューリスティックデザインの限界を克服し,メタコンプレックス(MetaCompress)と呼ばれる学習に基づくプロンプトに依存しない手法を提案する。トークンの削減を学習可能な圧縮マッピングとして定式化し、プルーニングやマージといった既存のフォーマットを単一の学習目標に統一することから始める。この定式化に伴い,計算コストに制限のある最適圧縮写像を学習できるデータ効率の訓練パラダイムを導入する。 MT-VQAベンチマークおよび複数のLVLMアーキテクチャの広範な実験により、MetaCompressは対話ターン間の強い一般化を維持しながら、優れた効率-精度のトレードオフを実現することが示された。私たちのコードはhttps://github.com/MArSha1147/MetaCompress.comから入手可能です。

論文の概要: Rethinking Token Reduction for Large Vision-Language Models

関連論文リスト