Fugu-MT 論文翻訳(概要): SceneGraphVLM: Dynamic Scene Graph Generation from Video with Vision-Language Models

論文の概要: SceneGraphVLM: Dynamic Scene Graph Generation from Video with Vision-Language Models

arxiv url: http://arxiv.org/abs/2605.13667v1
Date: Wed, 13 May 2026 15:27:41 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-14 23:30:28.132204
Title: SceneGraphVLM: Dynamic Scene Graph Generation from Video with Vision-Language Models
Title（参考訳）: SceneGraphVLM:視覚言語モデルを用いた映像からの動的シーングラフ生成
Authors: Vladislav Makarov, Mark Gizetdinov, Dmitry Yudin,
Abstract要約: SceneGraphVLMは、小さな視覚言語モデルを用いた画像および映像シーングラフ生成のためのコンパクトな方法である。 SceneGraphVLMはトークン効率のTOONフォーマットでグラフをシリアライズし、2段階でモデルをトレーニングする。 SceneGraphVLM on PSG, PVSG, Action Genome。
参考スコア（独自算出の注目度）: 0.25489046505746704
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Scene graph generation provides a compact structured representation for visual perception, but accurate and fast graph prediction from images and videos remains challenging. Recent VLM-based methods can generate scene graphs end-to-end as structured text, yet often produce long outputs with irrelevant objects and relations. We present SceneGraphVLM, a compact method for image and video scene graph generation with small visual language models. SceneGraphVLM serializes graphs in a token-efficient TOON format and trains the model in two stages: supervised fine-tuning followed by reinforcement learning with hallucination-aware rewards that balance relation coverage and precision while penalizing unsupported objects and relations. For videos, the model can optionally condition each frame on the previously generated graph, providing lightweight short-term context without tracking or post-processing. We evaluate SceneGraphVLM on PSG, PVSG, and Action Genome. With compact VLMs and vLLM-accelerated decoding, SceneGraphVLM achieves a strong quality-speed trade-off, improves precision-oriented SGG metrics while preserving reasonable recall, and generates complete scene graphs with approximately one-second latency. Code and implementation details are available at: https://github.com/markus0440/SceneGraphVLM.git.
Abstract（参考訳）: シーングラフ生成は視覚知覚のためのコンパクトな構造化表現を提供するが、画像やビデオからの正確かつ高速なグラフ予測は難しいままである。 VLMベースの最近の手法は、構造化されたテキストとしてシーングラフをエンドツーエンドに生成できるが、無関係なオブジェクトや関係を持つ長い出力を生成することが多い。 SceneGraphVLMは、小さな視覚言語モデルを用いた画像および映像シーングラフ生成のコンパクトな方法である。 SceneGraphVLMはトークン効率のTOON形式でグラフをシリアライズし、教師付き微調整と、支援対象や関係をペナライズしながら、関係のカバレッジと精度のバランスをとる幻覚認識報酬による強化学習の2段階でモデルを訓練する。ビデオの場合、モデルは以前生成されたグラフ上の各フレームを任意に条件付けし、追跡や後処理なしに軽量な短期コンテキストを提供する。 SceneGraphVLM on PSG, PVSG, Action Genome。コンパクトなVLMとvLLMアクセラレーションデコーディングにより、SceneGraphVLMは強力な品質-速度トレードオフを実現し、合理的なリコールを維持しながら精度指向のSGGメトリクスを改善し、約1秒のレイテンシで完全なシーングラフを生成する。コードと実装の詳細は、https://github.com/markus0440/SceneGraphVLM.gitで確認できる。

論文の概要: SceneGraphVLM: Dynamic Scene Graph Generation from Video with Vision-Language Models

関連論文リスト