Fugu-MT 論文翻訳(概要): Are We Using the Right Benchmark: An Evaluation Framework for Visual Token Compression Methods

論文の概要: Are We Using the Right Benchmark: An Evaluation Framework for Visual Token Compression Methods

arxiv url: http://arxiv.org/abs/2510.07143v1
Date: Wed, 08 Oct 2025 15:44:28 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-09 16:41:20.601057
Title: Are We Using the Right Benchmark: An Evaluation Framework for Visual Token Compression Methods
Title（参考訳）: 正しいベンチマークを使うか:ビジュアルトーケン圧縮手法の評価フレームワーク
Authors: Chenfei Liao, Wensong Wang, Zichen Wen, Xu Zheng, Yiyu Wang, Haocong He, Yuanhuiyi Lyu, Lutao Jiang, Xin Zou, Yuqian Fu, Bin Ren, Linfeng Zhang, Xuming Hu,
Abstract要約: 単純な画像ダウンサンプリングは、複数の広く使用されているベンチマークにおいて、多くの高度な圧縮方法より一貫して優れていることを示す。これらの結果に触発され,既存のベンチマークを識別するデータフィルタリング機構を組み込んだ評価フレームワークであるVTC-Benchを導入する。
参考スコア（独自算出の注目度）: 54.4711434793961
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent endeavors to accelerate inference in Multimodal Large Language Models (MLLMs) have primarily focused on visual token compression. The effectiveness of these methods is typically assessed by measuring the accuracy drop on established benchmarks, comparing model performance before and after compression. However, these benchmarks are originally designed to assess the perception and reasoning capabilities of MLLMs, rather than to evaluate compression techniques. As a result, directly applying them to visual token compression introduces a task mismatch. Strikingly, our investigation reveals that simple image downsampling consistently outperforms many advanced compression methods across multiple widely used benchmarks. Through extensive experiments, we make the following observations: (i) Current benchmarks are noisy for the visual token compression task. (ii) Down-sampling is able to serve as a data filter to evaluate the difficulty of samples in the visual token compression task. Motivated by these findings, we introduce VTC-Bench, an evaluation framework that incorporates a data filtering mechanism to denoise existing benchmarks, thereby enabling fairer and more accurate assessment of visual token compression methods. All data and code are available at https://github.com/Chenfei-Liao/VTC-Bench.
Abstract（参考訳）: MLLM(Multimodal Large Language Models)における推論を高速化するための最近の取り組みは、主に視覚的トークン圧縮に焦点を当てている。これらの手法の有効性は、確立されたベンチマークの精度低下を測定し、圧縮前後のモデル性能を比較して評価するのが一般的である。しかし、これらのベンチマークは元々、圧縮技術を評価するのではなく、MLLMの知覚と推論能力を評価するために設計されている。結果として、それらをビジュアルトークン圧縮に直接適用すると、タスクミスマッチが発生する。興味深いことに、我々の調査では、単純なイメージダウンサンプリングが、複数の広く使用されているベンチマークにおいて、多くの高度な圧縮方法より一貫して優れていることが判明した。広範な実験を通して、以下の観察を行う。 (i)現在のベンチマークは、ビジュアルトークン圧縮タスクにうるさい。 (II)ダウンサンプリングは、ビジュアルトークン圧縮タスクにおけるサンプルの難易度を評価するためのデータフィルタとして機能する。 VTC-Benchは,既存のベンチマークを識別するためのデータフィルタリング機構を組み込んだ評価フレームワークで,より公平かつ正確なビジュアルトークン圧縮手法の評価を可能にする。すべてのデータとコードはhttps://github.com/Chenfei-Liao/VTC-Bench.comで入手できる。

論文の概要: Are We Using the Right Benchmark: An Evaluation Framework for Visual Token Compression Methods

関連論文リスト