Fugu-MT 論文翻訳(概要): VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text?

論文の概要: VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text?

arxiv url: http://arxiv.org/abs/2602.04802v1
Date: Wed, 04 Feb 2026 17:48:55 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-05 19:45:11.664371
Title: VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text?
Title（参考訳）: VISTA-Bench:ヴィジュアライズされたテキストは純粋テキストと同じくらいよく理解されているか?
Authors: Qing'an Liu, Juntong Feng, Yuhao Wang, Xinzhe Han, Yujie Cheng, Yue Zhu, Haiwen Diao, Yunzhi Zhuge, Huchuan Lu,
Abstract要約: VLM(Vision-Language Models)は、テキスト入力と視覚入力のクロスモーダル理解において、優れたパフォーマンスを実現している。 VISTA-Benchは、マルチモーダル認識、推論、および非モーダル理解領域のベンチマークである。
参考スコア（独自算出の注目度）: 51.02924254085878
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language Models (VLMs) have achieved impressive performance in cross-modal understanding across textual and visual inputs, yet existing benchmarks predominantly focus on pure-text queries. In real-world scenarios, language also frequently appears as visualized text embedded in images, raising the question of whether current VLMs handle such input requests comparably. We introduce VISTA-Bench, a systematic benchmark from multimodal perception, reasoning, to unimodal understanding domains. It evaluates visualized text understanding by contrasting pure-text and visualized-text questions under controlled rendering conditions. Extensive evaluation of over 20 representative VLMs reveals a pronounced modality gap: models that perform well on pure-text queries often degrade substantially when equivalent semantic content is presented as visualized text. This gap is further amplified by increased perceptual difficulty, highlighting sensitivity to rendering variations despite unchanged semantics. Overall, VISTA-Bench provides a principled evaluation framework to diagnose this limitation and to guide progress toward more unified language representations across tokenized text and pixels. The source dataset is available at https://github.com/QingAnLiu/VISTA-Bench.
Abstract（参考訳）: VLM(Vision-Language Models)は、テキスト入力とビジュアル入力の相互理解において、優れたパフォーマンスを実現しているが、既存のベンチマークは主に純粋テキストクエリに焦点を当てている。現実のシナリオでは、言語はしばしば画像に埋め込まれた可視化されたテキストとして現れ、現在のVLMがそのような入力要求を相互に扱うかどうかという疑問が提起される。 VISTA-Benchは、マルチモーダル認識、推論から一様理解領域への体系的ベンチマークである。制御されたレンダリング条件下で、純粋テキストと可視化テキストの質問を対比することにより、可視化されたテキスト理解を評価する。 20以上の代表的VLMの広範囲な評価は、明らかなモダリティギャップを明らかにしている: 純粋テキストクエリでうまく機能するモデルは、等価なセマンティックコンテンツが視覚化されたテキストとして提示されたときに、大幅に劣化することが多い。このギャップは知覚の難しさの増大によってさらに増幅され、意味論の相違にもかかわらず、レンダリングの変化に対する感受性が強調される。全体として、VISTA-Benchは、この制限を診断し、トークン化されたテキストとピクセルをまたいだより統一された言語表現への進歩を導くための、原則化された評価フレームワークを提供する。ソースデータセットはhttps://github.com/QingAnLiu/VISTA-Bench.comで公開されている。

論文の概要: VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text?

関連論文リスト