Fugu-MT 論文翻訳(概要): PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions

論文の概要: PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions

arxiv url: http://arxiv.org/abs/2510.19060v1
Date: Tue, 21 Oct 2025 20:30:20 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 03:08:14.63225
Title: PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions
Title（参考訳）: PoSh: 詳細な画像記述のために、Scene Graphsを使ってLCMs-as-a-Judgeをガイドする
Authors: Amith Ananthram, Elias Stengel-Eskin, Lorena A. Bradford, Julia Demarest, Adam Purvis, Keith Krut, Robert Stein, Rina Elster Pantalony, Mohit Bansal, Kathleen McKeown,
Abstract要約: PoShは、LLMs-as-a-Judgeをガイドするために、シーングラフを構造化ルーリックとして使用する詳細な画像記述のメトリクスである。 PoShはレプリカ可能で、解釈可能で、既存のメトリクスよりも人間のレーダのプロキシが優れている。我々は,オープンウェイトな選択肢よりも,DOCENTにおける人間の判断とPoShの相関が強いことを示す。
参考スコア（独自算出の注目度）: 55.95282725491425
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While vision-language models (VLMs) have advanced into detailed image description, evaluation remains a challenge. Standard metrics (e.g. CIDEr, SPICE) were designed for short texts and tuned to recognize errors that are now uncommon, such as object misidentification. In contrast, long texts require sensitivity to attribute and relation attachments and scores that localize errors to particular text spans. In this work, we introduce PoSh, a metric for detailed image description that uses scene graphs as structured rubrics to guide LLMs-as-a-Judge, producing aggregate scores grounded in fine-grained errors (e.g. mistakes in compositional understanding). PoSh is replicable, interpretable and a better proxy for human raters than existing metrics (including GPT4o-as-a-Judge). To validate PoSh, we introduce a challenging new dataset, DOCENT. This novel benchmark contains artwork, paired with expert-written references, and model-generated descriptions, augmented with granular and coarse judgments of their quality from art history students. Thus, DOCENT enables evaluating both detailed image description metrics and detailed image description itself in a challenging new domain. We show that PoSh achieves stronger correlations (+0.05 Spearman $\rho$) with the human judgments in DOCENT than the best open-weight alternatives, is robust to image type (using CapArena, an existing dataset of web imagery) and is a capable reward function, outperforming standard supervised fine-tuning. Then, using PoSh, we characterize the performance of open and closed models in describing the paintings, sketches and statues in DOCENT and find that foundation models struggle to achieve full, error-free coverage of images with rich scene dynamics, establishing a demanding new task to gauge VLM progress. Through both PoSh and DOCENT, we hope to enable advances in important areas such as assistive text generation.
Abstract（参考訳）: 視覚言語モデル(VLM)は詳細な画像記述に進化してきたが、評価は依然として課題である。標準メトリクス(例えば、CIDEr、SPICE)は短いテキストのために設計され、オブジェクトの誤識別など、現在一般的でないエラーを認識するように調整された。対照的に、長いテキストは属性や関係のアタッチメントに対する感度と、特定のテキストスパンにエラーをローカライズするスコアを必要とする。本稿では,LLMs-as-a-Judgeのガイドとしてシーングラフを用いた詳細な画像記述のための指標PoShを紹介する。 PoShは、既存のメトリクス(GPT4o-as-a-Judgeを含む)よりもレプリカ可能で、解釈可能で、人間のレーダのプロキシが優れている。 PoShを検証するために、我々は挑戦的な新しいデータセットであるDOCENTを導入した。この新しいベンチマークには、専門家による参照と組み合わせたアートワークと、モデル生成記述が含まれており、美術史の学生によるその品質の粒度と粗い判断が強化されている。したがって、DOCENTは、詳細な画像記述メトリクスと詳細な画像記述自体を、挑戦的な新しいドメインで評価することができる。従来のウェブ画像のデータセットであるCapArenaを用いて) 画像タイプに頑健であり, 標準的な教師付き微調整よりも優れた報酬関数である。そして,PoShを用いてDOCENTの絵画,スケッチ,彫像の描写において,オープンおよびクローズドなモデルの性能を特徴付けるとともに,リッチシーンのダイナミックスによる画像の完全かつエラーのないカバレッジの実現に苦慮し,VLMの進捗を計測する新たなタスクを確立する。 PoShとDOCENTの両方を通じて、補助テキスト生成などの重要な分野での進歩を期待する。

論文の概要: PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions

関連論文リスト