Fugu-MT 論文翻訳(概要): CompareBench: A Benchmark for Visual Comparison Reasoning in Vision-Language Models

論文の概要: CompareBench: A Benchmark for Visual Comparison Reasoning in Vision-Language Models

arxiv url: http://arxiv.org/abs/2509.22737v1
Date: Thu, 25 Sep 2025 21:14:11 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:18.838042
Title: CompareBench: A Benchmark for Visual Comparison Reasoning in Vision-Language Models
Title（参考訳）: CompareBench:視覚言語モデルにおける視覚比較推論のベンチマーク
Authors: Jie Cai, Kangning Yang, Lan Fu, Jiaming Ding, Jinlong Li, Huiming Sun, Daitao Xing, Jinglin Shen, Zibo Meng,
Abstract要約: CompareBenchは視覚言語モデル(VLM)における視覚比較推論を評価するためのベンチマークである。量(600)、時間(100)、幾何学(200)、空間(100)の4つのタスクにまたがる1000のQAペアで構成されている。
参考スコア（独自算出の注目度）: 9.358625944204443
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce CompareBench, a benchmark for evaluating visual comparison reasoning in vision-language models (VLMs), a fundamental yet understudied skill. CompareBench consists of 1000 QA pairs across four tasks: quantity (600), temporal (100), geometric (200), and spatial (100). It is derived from two auxiliary datasets that we constructed: TallyBench (2000 counting images with QA) and HistCaps (515 historical images with bilingual captions). We evaluate both closed-source APIs (OpenAI, Gemini, Claude) and open-source models (Qwen2.5-VL and Qwen3-VL series). Results show clear scaling trends but also reveal critical limitations: even the strongest models consistently fail at temporal ordering and spatial relations, and they often make mistakes in basic counting and geometric comparisons that are trivial for humans. These findings demonstrate that visual comparison remains a systematic blind spot for current VLMs. By providing controlled, diverse, and diagnostic evaluation, CompareBench establishes a foundation for advancing more reliable multimodal reasoning.
Abstract（参考訳）: 本稿では,視覚言語モデル(VLM)における視覚比較推論のベンチマークであるComparceBenchを紹介する。 CompareBenchは、量(600)、時間(100)、幾何学(200)、空間(100)の4つのタスクにまたがる1000のQAペアで構成されている。 TallyBench (2000 年) と HistCaps (515 年) の2つの補助的データセットから得られた。クローズドソースAPI(OpenAI, Gemini, Claude)とオープンソースモデル(Qwen2.5-VL, Qwen3-VL)の両方を評価した。最強のモデルでさえ、時間的順序や空間的関係において一貫して失敗し、基本的な数え上げや幾何学的比較において、人間にとって簡単な誤りを犯すことが多い。これらの結果は、現在のVLMでは、視覚的比較が系統的な盲点であることを示している。 CompareBenchは、制御され、多種多様な診断評価を提供することで、より信頼性の高いマルチモーダル推論を推進するための基盤を確立する。

論文の概要: CompareBench: A Benchmark for Visual Comparison Reasoning in Vision-Language Models

関連論文リスト