Fugu-MT 論文翻訳(概要): Render-in-the-Loop: Vector Graphics Generation via Visual Self-Feedback

論文の概要: Render-in-the-Loop: Vector Graphics Generation via Visual Self-Feedback

arxiv url: http://arxiv.org/abs/2604.20730v1
Date: Wed, 22 Apr 2026 16:15:28 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-23 15:36:11.222313
Title: Render-in-the-Loop: Vector Graphics Generation via Visual Self-Feedback
Title（参考訳）: Render-in-the-Loop:ビジュアルセルフフィードバックによるベクトルグラフィックス生成
Authors: Guotao Liang, Zhangcheng Wang, Juncheng Hu, Haitao Zhou, Ziteng Xue, Jing Zhang, Dong Xu, Qian Yu,
Abstract要約: 本稿では,SVG合成を段階的に視覚的に認識するプロセスとして再構成する新しい生成パラダイムを提案する。中間のコードを累積キャンバスにレンダリングすることで、モデルは各ステップで進化する視覚的コンテキストを明示的に観察する。このビジュアルループを市販のモデルに適用することは、インクリメンタルなビジュアルコードマッピングを活用できないため、最適ではないことを示す。
参考スコア（独自算出の注目度）: 29.19392406217364
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Large Language Models (MLLMs) have shown promising capabilities in generating Scalable Vector Graphics (SVG) via direct code synthesis. However, existing paradigms typically adopt an open-loop "blind drawing" approach, where models generate symbolic code sequences without perceiving intermediate visual outcomes. This methodology severely underutilizes the powerful visual priors embedded in MLLMs vision encoders, treating SVG generation as a disjointed textual sequence modeling task rather than an integrated visuo-spatial one. Consequently, models struggle to reason about partial canvas states and implicit occlusion relationships, which are visually explicit but textually ambiguous. To bridge this gap, we propose Render-in-the-Loop, a novel generation paradigm that reformulates SVG synthesis as a step-wise, visual-context-aware process. By rendering intermediate code states into a cumulative canvas, the model explicitly observes the evolving visual context at each step, leveraging on-the-fly feedback to guide subsequent generation. However, we demonstrate that applying this visual loop naively to off-the-shelf models is suboptimal due to their inability to leverage incremental visual-code mappings. To address this, we first utilize fine-grained path decomposition to construct dense multi-step visual trajectories, and then introduce a Visual Self-Feedback (VSF) training strategy to condition the next primitive generation on intermediate visual states. Furthermore, a Render-and-Verify (RaV) inference mechanism is proposed to effectively filter degenerate and redundant primitives. Our framework, instantiated on a multimodal foundation model, outperforms strong open-weight baselines on the standard MMSVGBench. This result highlights the remarkable data efficiency and generalization capability of our Render-in-the-Loop paradigm for both Text-to-SVG and Image-to-SVG tasks.
Abstract（参考訳）: MLLM(Multimodal Large Language Models)は、直接コード合成によってスケーラブルベクトルグラフィックス(SVG)を生成する有望な能力を示している。しかし、既存のパラダイムは一般にオープンループの"盲線描画"アプローチを採用しており、モデルが中間的な視覚的な結果を認識することなくシンボリックなコードシーケンスを生成する。この手法は,MLLMの視覚エンコーダに埋め込まれた強力な視覚的先入観を著しく過小評価し,SVG生成を統合ビジュオ空間ではなく,不整合テキストシーケンスモデリングタスクとして扱う。結果として、モデルは部分的なキャンバス状態と暗黙的な排他的関係について推論するのに苦労する。このギャップを埋めるために,SVG合成を段階的に視覚的コンテキスト認識プロセスとして再構成する新たな生成パラダイムであるRender-in-the-Loopを提案する。中間のコードステートを累積キャンバスにレンダリングすることで、モデルは各ステップで進化する視覚的コンテキストを明示的に観察し、オンザフライフィードバックを利用してその後の生成をガイドする。しかし、このビジュアルループを市販のモデルに適用することは、インクリメンタルなビジュアルコードマッピングを活用できないため、最適ではないことを示す。これを解決するために,我々はまず細粒度の経路分解を利用して高密度な多段階視覚軌道を構築し,次に視覚自己フィードバック(VSF)トレーニング戦略を導入し,中間的視覚状態に次の原始的生成を条件付ける。さらに、退化プリミティブと冗長プリミティブを効果的にフィルタするために、Render-and-Verify (RaV)推論機構を提案する。我々のフレームワークは、マルチモーダル基礎モデルに基づいてインスタンス化され、標準MMSVGBenchで強力なオープンウェイトベースラインを上回ります。この結果は、テキストからSVGタスクと画像からSVGタスクの両方において、Render-in-the-Loopパラダイムの顕著なデータ効率と一般化能力を強調します。

論文の概要: Render-in-the-Loop: Vector Graphics Generation via Visual Self-Feedback

関連論文リスト