Fugu-MT 論文翻訳(概要): SATURN: Autoregressive Image Generation Guided by Scene Graphs

論文の概要: SATURN: Autoregressive Image Generation Guided by Scene Graphs

arxiv url: http://arxiv.org/abs/2508.14502v1
Date: Wed, 20 Aug 2025 07:45:08 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-21 16:52:41.381
Title: SATURN: Autoregressive Image Generation Guided by Scene Graphs
Title（参考訳）: SATURN:Scene Graphsでガイドされた自動回帰画像生成
Authors: Thanh-Nhan Vo, Trong-Thuan Nguyen, Tam V. Nguyen, Minh-Triet Tran,
Abstract要約: 本稿では,シーングラフをサリエンス順序付きトークンシーケンスに変換する,VAR-CLIPの軽量拡張であるSATURNを紹介する。ビジュアルゲノムデータセットでは、SATURNはFIDを56.45%から21.62%に削減し、インセプションスコアを16.03から24.78に引き上げている。その結果,SATURNは構造的認識と最先端の自己回帰的忠実度を効果的に組み合わせていることが明らかとなった。
参考スコア（独自算出の注目度）: 12.322079280436888
License: http://creativecommons.org/licenses/by/4.0/
Abstract: State-of-the-art text-to-image models excel at photorealistic rendering but often struggle to capture the layout and object relationships implied by complex prompts. Scene graphs provide a natural structural prior, yet previous graph-guided approaches have typically relied on heavy GAN or diffusion pipelines, which lag behind modern autoregressive architectures in both speed and fidelity. We introduce SATURN (Structured Arrangement of Triplets for Unified Rendering Networks), a lightweight extension to VAR-CLIP that translates a scene graph into a salience-ordered token sequence, enabling a frozen CLIP-VQ-VAE backbone to interpret graph structure while fine-tuning only the VAR transformer. On the Visual Genome dataset, SATURN reduces FID from 56.45% to 21.62% and increases the Inception Score from 16.03 to 24.78, outperforming prior methods such as SG2IM and SGDiff without requiring extra modules or multi-stage training. Qualitative results further confirm improvements in object count fidelity and spatial relation accuracy, showing that SATURN effectively combines structural awareness with state-of-the-art autoregressive fidelity.
Abstract（参考訳）: 最先端のテキスト画像モデルはフォトリアリスティックレンダリングに優れるが、複雑なプロンプトによって入力されるレイアウトやオブジェクトの関係を捉えるのに苦労することが多い。シーングラフは、自然な構造的先行を提供するが、従来のグラフ誘導アプローチは、通常、高速かつ忠実なモダンな自己回帰アーキテクチャの遅れである重いGANまたは拡散パイプラインに依存している。本稿では,SATURN(Structured Arrangement of Triplets for Unified Rendering Networks)を導入し,VAR変換器のみを微調整しながら,凍結したCLIP-VQ-VAEバックボーンでグラフ構造を解釈できるようにする。ビジュアルゲノムデータセットでは、SATURNはFIDを56.45%から21.62%に減らし、インセプションスコアを16.03から24.78に引き上げ、SG2IMやSGDiffのような以前の手法よりも、追加のモジュールやマルチステージトレーニングを必要としない。その結果,SATURNは構造的認識と最先端の自己回帰的忠実さを効果的に組み合わせていることがわかった。

論文の概要: SATURN: Autoregressive Image Generation Guided by Scene Graphs

関連論文リスト