Fugu-MT 論文翻訳(概要): Rethinking Human-Object Interaction Evaluation for both Vision-Language Models and HOI-Specific Methods

論文の概要: Rethinking Human-Object Interaction Evaluation for both Vision-Language Models and HOI-Specific Methods

arxiv url: http://arxiv.org/abs/2508.18753v1
Date: Tue, 26 Aug 2025 07:30:53 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-27 17:42:38.727986
Title: Rethinking Human-Object Interaction Evaluation for both Vision-Language Models and HOI-Specific Methods
Title（参考訳）: 視覚言語モデルとHOI-Specific Methodの双方に対する人間と物体の相互作用評価の再考
Authors: Qinqian Lei, Bo Wang, Robby T. Tan,
Abstract要約: 本稿では,HOI検出を複数問合せタスクとして再構成する新しいベンチマークを提案する。提案した評価プロトコルは,VLM法とHOI法の両方において,最初の評価プロトコルである。
参考スコア（独自算出の注目度）: 33.074167753966314
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Prior human-object interaction (HOI) detection methods have integrated early vision-language models (VLMs) such as CLIP, but only as supporting components within their frameworks. In contrast, recent advances in large, generative VLMs suggest that these models may already possess strong ability to understand images involving HOI. This naturally raises an important question: can general-purpose standalone VLMs effectively solve HOI detection, and how do they compare with specialized HOI methods? Answering this requires a benchmark that can accommodate both paradigms. However, existing HOI benchmarks such as HICO-DET were developed before the emergence of modern VLMs, and their evaluation protocols require exact matches to annotated HOI classes. This is poorly aligned with the generative nature of VLMs, which often yield multiple valid interpretations in ambiguous cases. For example, a static image may capture a person mid-motion with a frisbee, which can plausibly be interpreted as either "throwing" or "catching". When only "catching" is annotated, the other, though equally plausible for the image, is marked incorrect when exact matching is used. As a result, correct predictions might be penalized, affecting both VLMs and HOI-specific methods. To avoid penalizing valid predictions, we introduce a new benchmark that reformulates HOI detection as a multiple-answer multiple-choice task, where each question includes only ground-truth positive options and a curated set of negatives that are constructed to reduce ambiguity (e.g., when "catching" is annotated, "throwing" is not selected as a negative to avoid penalizing valid predictions). The proposed evaluation protocol is the first of its kind for both VLMs and HOI methods, enabling direct comparison and offering new insight into the current state of progress in HOI understanding.
Abstract（参考訳）: 従来の人-物間相互作用(HOI)検出方法は、CLIPのような初期視覚言語モデル(VLM)を統合するが、フレームワーク内のサポートコンポーネントとしてのみ使用される。対照的に、大規模で生成的なVLMの最近の進歩は、これらのモデルが既にHOIを含む画像を理解する強力な能力を持っていることを示唆している。汎用的なスタンドアロンVLMは、HOI検出を効果的に解決できるのか? これを答えるには、両方のパラダイムに対応可能なベンチマークが必要です。しかし、HICO-DETのような既存のHOIベンチマークは、現代のVLMの出現以前に開発され、それらの評価プロトコルは、注釈付きHOIクラスと正確に一致する必要がある。これは、しばしばあいまいな場合において複数の有効な解釈をもたらすVLMの生成的性質と不一致である。例えば、静的な画像は、フリスビーで人の動きの途中を捉え、それは「投球」または「キャッチ」と解釈できる。キャッチ」がアノテートされた場合、もう1つは画像に等しく当てはまるが、正確なマッチングが使われると不正確となる。結果として、正しい予測は罰せられ、VLMとHOI固有の方法の両方に影響を与える可能性がある。提案手法では,複数問合せタスクとしてHOI検出を再構成するベンチマークを新たに導入する。各質問には,曖昧さを軽減するために構築された基本トラスト正のオプションと負のキュレートセットのみを含む(例えば,"キャッチ"アノテーションが付与された場合,"スロー"は負に選択されず,有効な予測をペナルライズしない)。提案した評価プロトコルは, VLMとHOIの両手法において, 直接比較が可能であり, HOI理解の進展状況に関する新たな知見を提供する。

論文の概要: Rethinking Human-Object Interaction Evaluation for both Vision-Language Models and HOI-Specific Methods

関連論文リスト