Fugu-MT 論文翻訳(概要): VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs

論文の概要: VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs

arxiv url: http://arxiv.org/abs/2510.16598v1
Date: Sat, 18 Oct 2025 17:54:18 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 00:56:39.058016
Title: VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs
Title（参考訳）: VisionSelector: 効率的なマルチモーダルLCMのための学習可能なビジュアルトーケン圧縮
Authors: Jiaying Zhu, Yurui Zhu, Xin Lu, Wenrui Yan, Dong Li, Kunlin Liu, Xueyang Fu, Zheng-Jun Zha,
Abstract要約: MLLM(Multimodal Large Language Models)は、計算とメモリのボトルネックに遭遇する。従来のトークン圧縮技術は、重要な情報を破棄するリスクを負うルールによって制約されることが多い。我々は,トークン圧縮をエンドツーエンドの学習可能な決定プロセスに再構成する軽量なプラグアンドプレイフレームワークとして,トークン圧縮を再構成する。
参考スコア（独自算出の注目度）: 82.72388893596555
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Large Language Models (MLLMs) encounter significant computational and memory bottlenecks from the massive number of visual tokens generated by high-resolution images or multi-image inputs. Previous token compression techniques are often constrained by heuristic rules that risk discarding critical information. They may suffer from biases, such as attention sinks, that lead to sharp performance drops under aggressive compression ratios. To address these limitations, we reformulate token compression as a lightweight plug-and-play framework that reformulates token compression into an end-to-end learnable decision process. To be specific, we propose VisionSelector, a scorer module decoupled from the MLLM backbone that incorporates a differentiable Top-K mechanism and a curriculum annealing strategy to bridge the training-inference gap, enabling efficient and adaptive token selection various arbitrary compression rates. Remarkably lightweight with only 12.85M trainable parameters, VisionSelector demonstrates generalization across various compression rates and adaptively identifying critical tokens. This leads to superior performance across all compression budgets, evidenced by preserving 100% accuracy on MME with 30% retention budget, outperforming prior methods by 12.14% at 10% retention budget, and doubling prefill speed. Our code is available at https://github.com/JulietChoo/VisionSelector .
Abstract（参考訳）: マルチモーダル大言語モデル(MLLM)は、高解像度画像やマルチイメージ入力によって生成される膨大な数の視覚トークンから、計算とメモリのボトルネックに遭遇する。従来のトークン圧縮技術は、重要な情報を破棄するリスクを負うヒューリスティックなルールによって制約されることが多い。それらは、アテンションシンクのようなバイアスに悩まされ、アグレッシブな圧縮比の下でパフォーマンスが急落する。これらの制約に対処するため、トークン圧縮をエンドツーエンドで学習可能な決定プロセスに再構成する軽量なプラグアンドプレイフレームワークとして、トークン圧縮を再構成する。具体的には,MLLMのバックボーンから分離したスコアリングモジュールであるVisionSelectorを提案し,Top-K機構とカリキュラムのアニール戦略を取り入れて,トレーニングと推論のギャップを埋める手法を提案する。トレーニング可能なパラメータは12.85Mに過ぎず、VisionSelectorは様々な圧縮速度で一般化し、重要なトークンを適応的に識別する。これは全ての圧縮予算に優れたパフォーマンスをもたらし、30%の保持予算でMMEに100%の精度を保ち、10%の保持予算で12.14%の先行手法を上回り、プリフィル速度を2倍にすることで証明された。私たちのコードはhttps://github.com/JulietChoo/VisionSelectorで利用可能です。

論文の概要: VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs

関連論文リスト