Fugu-MT 論文翻訳(概要): Interpretable Zero-shot Referring Expression Comprehension with Query-driven Scene Graphs

論文の概要: Interpretable Zero-shot Referring Expression Comprehension with Query-driven Scene Graphs

arxiv url: http://arxiv.org/abs/2603.25004v1
Date: Thu, 26 Mar 2026 04:05:30 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-27 20:52:48.087481
Title: Interpretable Zero-shot Referring Expression Comprehension with Query-driven Scene Graphs
Title（参考訳）: クエリ駆動のシーングラフを用いたゼロショット参照表現の解釈
Authors: Yike Wu, Necva Bolucu, Stephen Wan, Dadong Wang, Jiahao Xia, Jian Zhang,
Abstract要約: ゼロショット参照式理解(REC)は、自然言語クエリが与えられた画像中の対象物を特定することを目的としている。既存のビジョンランゲージモデル(VLM)は、テキストクエリと画像領域の特徴的類似性を測定することで、ゼロショットRECに対処する。我々は、クエリ駆動のシーングラフを構造化中間体として活用した、解釈可能なゼロショットREC法である textbfSGREC を提案する。
参考スコア（独自算出の注目度）: 18.414159451507153
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Zero-shot referring expression comprehension (REC) aims to locate target objects in images given natural language queries without relying on task-specific training data, demanding strong visual understanding capabilities. Existing Vision-Language Models~(VLMs), such as CLIP, commonly address zero-shot REC by directly measuring feature similarities between textual queries and image regions. However, these methods struggle to capture fine-grained visual details and understand complex object relationships. Meanwhile, Large Language Models~(LLMs) excel at high-level semantic reasoning, their inability to directly abstract visual features into textual semantics limits their application in REC tasks. To overcome these limitations, we propose \textbf{SGREC}, an interpretable zero-shot REC method leveraging query-driven scene graphs as structured intermediaries. Specifically, we first employ a VLM to construct a query-driven scene graph that explicitly encodes spatial relationships, descriptive captions, and object interactions relevant to the given query. By leveraging this scene graph, we bridge the gap between low-level image regions and higher-level semantic understanding required by LLMs. Finally, an LLM infers the target object from the structured textual representation provided by the scene graph, responding with detailed explanations for its decisions that ensure interpretability in the inference process. Extensive experiments show that SGREC achieves top-1 accuracy on most zero-shot REC benchmarks, including RefCOCO val (66.78\%), RefCOCO+ testB (53.43\%), and RefCOCOg val (73.28\%), highlighting its strong visual scene understanding.
Abstract（参考訳）: ゼロショット参照表現理解(REC)は、タスク固有のトレーニングデータに頼ることなく、与えられた自然言語クエリのイメージ中のターゲットオブジェクトを見つけることを目的としており、強力な視覚的理解能力を求めている。 CLIPのような既存のVision-Language Models~(VLM)は、テキストクエリと画像領域の間の特徴的類似性を直接測定することで、ゼロショットRECに対処する。しかし、これらの手法は、細かな視覚的詳細を捉え、複雑なオブジェクトの関係を理解するのに苦労する。一方、Large Language Models~(LLMs)は、高レベルのセマンティック推論において優れており、視覚的特徴を直接テキストセマンティクスに抽象化できないため、RECタスクでのアプリケーションの動作が制限される。これらの制限を克服するために,クエリ駆動のシーングラフを構造化中間体として活用した解釈可能なゼロショットREC法である \textbf{SGREC} を提案する。具体的には、まずVLMを用いて、与えられたクエリに関連する空間的関係、記述的キャプション、オブジェクト間の相互作用を明示的にエンコードするクエリ駆動のシーングラフを構築する。このシーングラフを利用することで、LLMが必要とする低レベル画像領域と高レベルのセマンティック理解のギャップを埋める。最後に、LLMは、シーングラフによって提供される構造化されたテキスト表現から対象オブジェクトを推論し、推論プロセスにおける解釈可能性を保証する決定に関する詳細な説明に応答する。大規模な実験により、SGRECはRefCOCO val (66.78\%)、RefCOCO+ testB (53.43\%)、RefCOCOg val (73.28\%)を含むほとんどのゼロショットRECベンチマークでトップ1の精度を達成した。

論文の概要: Interpretable Zero-shot Referring Expression Comprehension with Query-driven Scene Graphs

関連論文リスト