Fugu-MT 論文翻訳(概要): SGG-R$^{\rm 3}$: From Next-Token Prediction to End-to-End Unbiased Scene Graph Generation

論文の概要: SGG-R$^{\rm 3}$: From Next-Token Prediction to End-to-End Unbiased Scene Graph Generation

arxiv url: http://arxiv.org/abs/2603.07961v2
Date: Thu, 12 Mar 2026 07:26:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-13 14:46:25.432381
Title: SGG-R$^{\rm 3}$: From Next-Token Prediction to End-to-End Unbiased Scene Graph Generation
Title（参考訳）: SGG-R$^{\rm 3}$:Next-Token PredictionからEnd-to-End Unbiased Scene Graph Generationへ
Authors: Jiaye Feng, Qixiang Yin, Yuankun Liu, Tong Mo, Weiping Li,
Abstract要約: シーングラフ生成(SGG)は、オブジェクトのグラフとその関係として視覚的なシーンを構築する。 SGG-R$rm 3$は、タスク固有のチェーン・オブ・シークレット(CoT)誘導型教師付き微調整(SFT)と強化学習(RL)を統合した構造化推論フレームワークである。 2つのベンチマーク実験により、SGG-R$rm 3$は既存の手法に比べて優れた性能を発揮することが示された。
参考スコア（独自算出の注目度）: 8.542770965458821
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Scene Graph Generation (SGG) structures visual scenes as graphs of objects and their relations. While Multimodal Large Language Models (MLLMs) have advanced end-to-end SGG, current methods are hindered by both a lack of task-specific structured reasoning and the challenges of sparse, long-tailed relation distributions, resulting in incomplete scene graphs characterized by low recall and biased predictions. To address these issues, we introduce SGG-R$^{\rm 3}$, a structured reasoning framework that integrates task-specific chain-of-thought (CoT)-guided supervised fine-tuning (SFT) and reinforcement learning (RL) with group sequence policy optimization (GSPO), designed to engage in three sequential stages to achieve end-to-end unbiased scene graph generation. During the SFT phase, we propose a relation augmentation strategy by leveraging an MLLM and refined via embedding similarity filtering to alleviate relation sparsity. Subsequently, a stage-aligned reward scheme optimizes the procedural reasoning during RL. Specifically, we propose a novel dual-granularity reward which integrates fine-grained and coarse-grained relation rewards, simultaneously mitigating the long-tail issue via frequency-based adaptive weighting of predicates and improving relation coverage through semantic clustering. Experiments on two benchmarks show that SGG-R$^{\rm 3}$ achieves superior performance compared to existing methods, demonstrating the effectiveness and generalization of the framework.
Abstract（参考訳）: シーングラフ生成(SGG)は、オブジェクトのグラフとその関係として視覚的なシーンを構築する。 MLLM(Multimodal Large Language Models)には高度なエンドツーエンドのSGGがあるが、現在の手法はタスク固有の構造的推論の欠如と、スパースで長い尾を持つ関係分布の課題の両方によって妨げられている。これらの問題に対処するために、SGG-R$^{\rm 3}$という、タスク固有のチェーン・オブ・シークレット(CoT)誘導型微調整(SFT)と強化学習(RL)を統合した構造化推論フレームワークを導入し、グループシーケンスポリシー最適化(GSPO)を用いて、エンドツーエンドのシーングラフ生成を実現する。 SFT フェーズにおいて,MLLM を利用した関係強化戦略を提案する。その後、段階的な報酬スキームは、RL中の手続き的推論を最適化する。具体的には、細粒度と粗粒度の関係報酬を統合し、予測の周波数に基づく適応重み付けとセマンティッククラスタリングによる関係カバレッジの改善により、ロングテール問題を緩和する新しい二重粒度報酬を提案する。 2つのベンチマーク実験により、SGG-R$^{\rm 3}$は既存の手法よりも優れた性能を示し、フレームワークの有効性と一般化を実証した。

論文の概要: SGG-R$^{\rm 3}$: From Next-Token Prediction to End-to-End Unbiased Scene Graph Generation

関連論文リスト