Fugu-MT 論文翻訳(概要): Seeing is Improving: Visual Feedback for Iterative Text Layout Refinement

論文の概要: Seeing is Improving: Visual Feedback for Iterative Text Layout Refinement

arxiv url: http://arxiv.org/abs/2603.22187v1
Date: Mon, 23 Mar 2026 16:48:39 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-24 19:11:39.794192
Title: Seeing is Improving: Visual Feedback for Iterative Text Layout Refinement
Title（参考訳）: Lookingが改善 - 反復テキストレイアウトリファインメントのためのビジュアルフィードバック
Authors: Junrong Guo, Shancheng Fang, Yadong Qu, Hongtao Xie,
Abstract要約: Visual Feedback Layout Model (VFLM)は、視覚フィードバックを反復的に改善するフレームワークである。 MLLM、既存のレイアウトモデル、コードのみのベースラインを一貫して上回っている。
参考スコア（独自算出の注目度）: 46.546443161594304
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have enabled automated generation of structured layouts from natural language descriptions. Existing methods typically follow a code-only paradigm that generates code to represent layouts, which are then rendered by graphic engines to produce final images. However, they are blind to the rendered visual outcome, making it difficult to guarantee readability and aesthetics. In this paper, we identify visual feedback as a critical factor in layout generation and propose Visual Feedback Layout Model (VFLM), a self-improving framework that leverages visual feedback iterative refinement. VFLM is capable of performing adaptive reflective generation, which leverages visual information to reflect on previous issues and iteratively generates outputs until satisfactory quality is achieved. It is achieved through reinforcement learning with a visually grounded reward model that incorporates OCR accuracy. By rewarding only the final generated outcome, we can effectively stimulate the model's iterative and reflective generative capabilities. Experiments across multiple benchmarks show that VFLM consistently outperforms advanced MLLMs, existing layout models, and code-only baselines, establishing visual feedback as critical for design-oriented MLLMs. Our code and data are available at https://github.com/FolSpark/VFLM.
Abstract（参考訳）: MLLM(Multimodal Large Language Models)の最近の進歩により、自然言語記述から構造化レイアウトの自動生成が可能になった。既存のメソッドは通常、レイアウトを表すコードを生成するコードのみのパラダイムに従っており、最終的なイメージを生成するためにグラフィックエンジンによってレンダリングされる。しかし、彼らは視覚的な結果に盲目であり、可読性と美学を保証することは困難である。本稿では,視覚フィードバックをレイアウト生成の重要な要因として認識し,視覚フィードバックの反復改善を活用した自己改善フレームワークであるVisual Feedback Layout Model (VFLM)を提案する。 VFLMは、視覚情報を利用して以前の問題を反映し、良好な品質に達するまで繰り返し出力を生成する適応反射生成を行うことができる。 OCRの精度を組み込んだ視覚的グラウンドの報酬モデルを用いて、強化学習によって達成される。最終的な結果のみを報奨することで、モデルの反復的で反射的な生成能力を効果的に刺激することができる。複数のベンチマーク実験により、VFLMは高度なMLLM、既存のレイアウトモデル、コードのみのベースラインを一貫して上回り、設計指向のMLLMにとって重要な視覚フィードバックを確立した。私たちのコードとデータはhttps://github.com/FolSpark/VFLM.comで公開されています。

論文の概要: Seeing is Improving: Visual Feedback for Iterative Text Layout Refinement

関連論文リスト