Fugu-MT 論文翻訳(概要): STATUS Bench: A Rigorous Benchmark for Evaluating Object State Understanding in Vision-Language Models

論文の概要: STATUS Bench: A Rigorous Benchmark for Evaluating Object State Understanding in Vision-Language Models

arxiv url: http://arxiv.org/abs/2510.22571v1
Date: Sun, 26 Oct 2025 08:04:28 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-28 15:28:15.249
Title: STATUS Bench: A Rigorous Benchmark for Evaluating Object State Understanding in Vision-Language Models
Title（参考訳）: STATUS Bench:視覚言語モデルにおけるオブジェクト状態理解の評価のための厳格なベンチマーク
Authors: Mahiro Ukai, Shuhei Kurita, Nakamasa Inoue,
Abstract要約: 対象状態の微妙な変化を理解するための視覚言語モデルの有効性を厳格に評価する最初のベンチマークであるSTATUS Benchを紹介する。 STATUS Benchは、オブジェクト状態識別(OSI)、画像検索(IR)、状態変化識別(SCI)の3つのタスクを同時に実行するためにVLMを必要とする。さらに,1300万の半自動記述からなる大規模トレーニングデータセットSTATUS Trainを導入する。
参考スコア（独自算出の注目度）: 28.438936778310865
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Object state recognition aims to identify the specific condition of objects, such as their positional states (e.g., open or closed) and functional states (e.g., on or off). While recent Vision-Language Models (VLMs) are capable of performing a variety of multimodal tasks, it remains unclear how precisely they can identify object states. To alleviate this issue, we introduce the STAte and Transition UnderStanding Benchmark (STATUS Bench), the first benchmark for rigorously evaluating the ability of VLMs to understand subtle variations in object states in diverse situations. Specifically, STATUS Bench introduces a novel evaluation scheme that requires VLMs to perform three tasks simultaneously: object state identification (OSI), image retrieval (IR), and state change identification (SCI). These tasks are defined over our fully hand-crafted dataset involving image pairs, their corresponding object state descriptions and state change descriptions. Furthermore, we introduce a large-scale training dataset, namely STATUS Train, which consists of 13 million semi-automatically created descriptions. This dataset serves as the largest resource to facilitate further research in this area. In our experiments, we demonstrate that STATUS Bench enables rigorous consistency evaluation and reveal that current state-of-the-art VLMs still significantly struggle to capture subtle object state distinctions. Surprisingly, under the proposed rigorous evaluation scheme, most open-weight VLMs exhibited chance-level zero-shot performance. After fine-tuning on STATUS Train, Qwen2.5-VL achieved performance comparable to Gemini 2.0 Flash. These findings underscore the necessity of STATUS Bench and Train for advancing object state recognition in VLM research.
Abstract（参考訳）: オブジェクトの状態認識は、オブジェクトの位置状態(例えば、オープンまたはクローズド)や機能状態(例えば、オンまたはオフ)など、オブジェクトの特定の状態を特定することを目的としている。近年のVLM(Vision-Language Models)は、様々なマルチモーダルタスクを実行できるが、オブジェクトの状態がどの程度正確に識別できるかは定かではない。この問題を緩和するために、様々な状況下でオブジェクト状態の微妙な変化を理解するVLMの能力を厳格に評価する最初のベンチマークであるSTAte and Transition UnderStanding Benchmark(STATUS Bench)を紹介します。特に、STATUS Benchは、オブジェクト状態識別(OSI)、画像検索(IR)、状態変化識別(SCI)の3つのタスクを同時に実行することを要求する新しい評価手法を導入した。これらのタスクは、イメージペア、対応するオブジェクト状態記述、状態変更記述を含む、完全に手作りのデータセット上で定義されます。さらに,1300万の半自動記述からなる大規模トレーニングデータセットSTATUS Trainを導入する。このデータセットは、この分野のさらなる研究を促進するための最大のリソースである。実験では,STATUS Benchが厳密な整合性評価を可能にし,現状のVLMが微妙な物体状態の区別を捉えるのに苦戦していることを示す。驚いたことに、厳密な評価スキームでは、ほとんどのオープンウェイトVLMはチャンスレベルのゼロショット性能を示した。 STATUS Trainの微調整の後、Qwen2.5-VLはGemini 2.0 Flashに匹敵する性能を達成した。これらの知見は、VLM研究における物体状態認識の進歩におけるSTATUS BenchとTrainの必要性を裏付けるものである。

論文の概要: STATUS Bench: A Rigorous Benchmark for Evaluating Object State Understanding in Vision-Language Models

関連論文リスト