Fugu-MT 論文翻訳(概要): SHOE: Semantic HOI Open-Vocabulary Evaluation Metric

論文の概要: SHOE: Semantic HOI Open-Vocabulary Evaluation Metric

arxiv url: http://arxiv.org/abs/2604.01586v1
Date: Thu, 02 Apr 2026 03:53:39 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-03 14:21:10.216596
Title: SHOE: Semantic HOI Open-Vocabulary Evaluation Metric
Title（参考訳）: SHOE: Semantic HOI Open-Vocabulary Evaluation Metric
Authors: Maja Noack, Qinqian Lei, Taipeng Tian, Bihan Dong, Robby T. Tan, Yixin Chen, John Young, Saijun Zhang, Bo Wang,
Abstract要約: 新しい評価フレームワークであるSHOE(Semantic HOI Open-Vocabulary Evaluation)を紹介する。 SHOEは予測されたHOIラベルと接地したHOIラベルのセマンティックな類似性を取り入れている。その結果、SHOEスコアは既存の指標よりも人間の判断と密接に一致していることがわかった。
参考スコア（独自算出の注目度）: 28.578980275126707
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Open-vocabulary human-object interaction (HOI) detection is a step towards building scalable systems that generalize to unseen interactions in real-world scenarios and support grounded multimodal systems that reason about human-object relationships. However, standard evaluation metrics, such as mean Average Precision (mAP), treat HOI classes as discrete categorical labels and fail to credit semantically valid but lexically different predictions (e.g., "lean on couch" vs. "sit on couch"), limiting their applicability for evaluating open-vocabulary predictions that go beyond any predefined set of HOI labels. We introduce SHOE (Semantic HOI Open-Vocabulary Evaluation), a new evaluation framework that incorporates semantic similarity between predicted and ground-truth HOI labels. SHOE decomposes each HOI prediction into its verb and object components, estimates their semantic similarity using the average of multiple large language models (LLMs), and combines them into a similarity score to evaluate alignment beyond exact string match. This enables a flexible and scalable evaluation of both existing HOI detection methods and open-ended generative models using standard benchmarks such as HICO-DET. Experimental results show that SHOE scores align more closely with human judgments than existing metrics, including LLM-based and embedding-based baselines, achieving an agreement of 85.73% with the average human ratings. Our work underscores the need for semantically grounded HOI evaluation that better mirrors human understanding of interactions. We will release our evaluation metric to the public to facilitate future research.
Abstract（参考訳）: オープン・ボキャブラリ・ヒューマン・オブジェクト・インタラクション(HOI)検出は、現実世界のシナリオにおける見えないインタラクションを一般化し、人間とオブジェクトの関係を推論する基盤化されたマルチモーダルシステムをサポートするスケーラブルなシステムを構築するためのステップである。しかし、平均的平均精度(mAP)のような標準的な評価指標は、HOIクラスを個別の分類ラベルとして扱い、意味論的に妥当だが語彙的に異なる予測(例えば、"lean on couch" 対 "sit on couch" など)を信用できない。 SHOE (Semantic HOI Open-Vocabulary Evaluation) は,予測されたHOIラベルと接地型HOIラベルのセマンティック類似性を組み込んだ新しい評価フレームワークである。 SHOEは、各HOI予測を動詞とオブジェクトコンポーネントに分解し、複数の大言語モデル(LLM)の平均を用いてそれらの意味的類似性を推定し、それらを類似度スコアに組み合わせ、正確な文字列マッチングを超えてアライメントを評価する。これにより、HICO-DETなどの標準ベンチマークを用いて、既存のHOI検出方法とオープンな生成モデルの両方を柔軟かつスケーラブルに評価することができる。実験の結果、SHOEスコアは、LLMベースや埋め込みベースラインを含む既存の指標よりも人間の判断とより密接に一致し、平均的な人間の評価と85.73%の合意に達した。我々の研究は、人間との相互作用の理解をより良く反映する意味論的根拠に基づくHOI評価の必要性を浮き彫りにしている。今後の研究を促進するため、評価基準を公開します。

論文の概要: SHOE: Semantic HOI Open-Vocabulary Evaluation Metric

関連論文リスト