Fugu-MT 論文翻訳(概要): BLINK-Twice: You see, but do you observe? A Reasoning Benchmark on Visual Perception

論文の概要: BLINK-Twice: You see, but do you observe? A Reasoning Benchmark on Visual Perception

arxiv url: http://arxiv.org/abs/2510.09361v1
Date: Fri, 10 Oct 2025 13:14:13 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 00:38:49.090011
Title: BLINK-Twice: You see, but do you observe? A Reasoning Benchmark on Visual Perception
Title（参考訳）: BLINK-Twice: 見えますが、観察できますか? 視覚知覚に関する推論ベンチマーク
Authors: Junyan Ye, Dongzhi Jiang, Jun He, Baichuan Zhou, Zilong Huang, Zhiyuan Yan, Hongsheng Li, Conghui He, Weijia Li,
Abstract要約: 我々は視覚中心の推論ベンチマークであるBLINK-Twiceを紹介した。外部の知識に頼るのではなく、私たちのタスクは視覚的コンテンツのみから推論するモデルを必要とします。事前の知覚ベンチマークと比較すると、浅い知覚を超越し、きめ細かい観察と分析的推論を必要とする。
参考スコア（独自算出の注目度）: 67.89135437537179
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, Multimodal Large Language Models (MLLMs) have made rapid progress, particularly in enhancing their reasoning capabilities. However, existing reasoning benchmarks still primarily assess language-based reasoning, often treating visual input as replaceable context. To address this gap, we introduce BLINK-Twice, a vision-centric reasoning benchmark grounded in challenging perceptual tasks. Instead of relying on external knowledge, our tasks require models to reason from visual content alone, shifting the focus from language-based to image-grounded reasoning. Compared to prior perception benchmarks, it moves beyond shallow perception ("see") and requires fine-grained observation and analytical reasoning ("observe"). BLINK-Twice integrates three core components: seven types of visual challenges for testing visual reasoning, natural adversarial image pairs that enforce reliance on visual content, and annotated reasoning chains for fine-grained evaluation of the reasoning process rather than final answers alone. We evaluate 20 leading MLLMs, including 12 foundation models and 8 reasoning-enhanced models. BLINK-Twice poses a significant challenge to current models. While existing reasoning strategies in the language space-such as chain-of-thought or self-criticism can improve performance, they often result in unstable and redundant reasoning. We observe that repeated image observation improves performance across models, and active visual interaction, as demonstrated by models like o3, highlights the need for a new paradigm for vision reasoning. The dataset is publicly available at https://github.com/PicoTrex/BLINK-Twice
Abstract（参考訳）: 近年、MLLM(Multimodal Large Language Models)は、特に推論能力の向上において急速な進歩を遂げている。しかし、既存の推論ベンチマークは言語ベースの推論を主に評価しており、視覚入力を代替可能なコンテキストとして扱うことが多い。このギャップに対処するために、視覚中心の推論ベンチマークであるBLINK-Twiceを導入する。外部の知識に頼るのではなく、私たちのタスクは、視覚的コンテンツのみから推論するモデルを必要とします。事前の知覚ベンチマークと比較すると、浅い知覚("see")を超えて、きめ細かい観察と分析的推論("observe")を必要とする。 BLINK-Twiceは、視覚的推論をテストするための7種類の視覚的課題、視覚的コンテンツへの依存を強制する自然な敵対的イメージペア、最終回答のみではなく、推論プロセスの詳細な評価のための注釈付き推論チェーンの3つのコアコンポーネントを統合している。 12の基盤モデルと8の推論強化モデルを含む20のMLLMを評価した。 BLINK-Twiceは現在のモデルにとって大きな課題となる。思考の連鎖や自己批判のような言語空間における既存の推論戦略は、パフォーマンスを改善することができるが、しばしば不安定で冗長な推論をもたらす。 O3のようなモデルが示すように、反復的な画像観察はモデル間の性能を改善し、アクティブな視覚相互作用は視覚推論のための新しいパラダイムの必要性を強調している。データセットはhttps://github.com/PicoTrex/BLINK-Twiceで公開されている。

論文の概要: BLINK-Twice: You see, but do you observe? A Reasoning Benchmark on Visual Perception

関連論文リスト