Fugu-MT 論文翻訳(概要): Iterative Refinement Improves Compositional Image Generation

論文の概要: Iterative Refinement Improves Compositional Image Generation

arxiv url: http://arxiv.org/abs/2601.15286v1
Date: Wed, 21 Jan 2026 18:59:40 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-22 21:27:50.511379
Title: Iterative Refinement Improves Compositional Image Generation
Title（参考訳）: イテレーティブリファインメントは構成画像生成を改善する
Authors: Shantanu Jaiswal, Mihir Prabhudesai, Nikash Bhardwaj, Zheyang Qin, Amir Zadeh, Chuan Li, Katerina Fragkiadaki, Deepak Pathak,
Abstract要約: Text-to-image (T2I)モデルは、複数のオブジェクト、リレーション、属性を同時に扱う必要がある複雑なプロンプトと競合する。本稿では,T2Iモデルを複数のステップで段階的に改良する反復的テストタイム戦略を提案する。われわれのアプローチは単純で、外部ツールや事前は必要とせず、幅広い画像生成器や視覚言語モデルに柔軟に適用できる。
参考スコア（独自算出の注目度）: 47.116050084875106
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text-to-image (T2I) models have achieved remarkable progress, yet they continue to struggle with complex prompts that require simultaneously handling multiple objects, relations, and attributes. Existing inference-time strategies, such as parallel sampling with verifiers or simply increasing denoising steps, can improve prompt alignment but remain inadequate for richly compositional settings where many constraints must be satisfied. Inspired by the success of chain-of-thought reasoning in large language models, we propose an iterative test-time strategy in which a T2I model progressively refines its generations across multiple steps, guided by feedback from a vision-language model as the critic in the loop. Our approach is simple, requires no external tools or priors, and can be flexibly applied to a wide range of image generators and vision-language models. Empirically, we demonstrate consistent gains on image generation across benchmarks: a 16.9% improvement in all-correct rate on ConceptMix (k=7), a 13.8% improvement on T2I-CompBench (3D-Spatial category) and a 12.5% improvement on Visual Jenga scene decomposition compared to compute-matched parallel sampling. Beyond quantitative gains, iterative refinement produces more faithful generations by decomposing complex prompts into sequential corrections, with human evaluators preferring our method 58.7% of the time over 41.3% for the parallel baseline. Together, these findings highlight iterative self-correction as a broadly applicable principle for compositional image generation. Results and visualizations are available at https://iterative-img-gen.github.io/
Abstract（参考訳）: テキスト・ツー・イメージ(T2I)モデルは目覚ましい進歩を遂げているが、複数のオブジェクト、リレーション、属性を同時に扱う必要がある複雑なプロンプトに悩まされ続けている。検証器による並列サンプリングや単純なデノナイジングステップなどの既存の推論時間戦略は、迅速なアライメントを改善することができるが、多くの制約を満たさなければならないリッチな構成的な設定では不十分である。大規模言語モデルにおけるチェーン・オブ・シークレット推論の成功に触発されて、我々は、T2Iモデルが、ループの批判者としての視覚言語モデルからのフィードバックによって、複数のステップにわたって、その世代を段階的に洗練する反復的テストタイム戦略を提案する。われわれのアプローチは単純で、外部ツールや事前は必要とせず、幅広い画像生成器や視覚言語モデルに柔軟に適用できる。実験的に、ベンチマークによる画像生成における一貫した向上を示す:ConceptMix(k=7)の全補正率の16.9%改善、T2I-CompBench(3D-Spatial category)の13.8%改善、計算整合並列サンプリングと比較してVisual Jengaのシーン分解の12.5%改善。定量的な利得の他に、反復的な改善は複雑なプロンプトを逐次的な修正に分解することでより忠実な世代を生み出す。これらの知見は, 合成画像生成の原理として, 反復的自己補正が重要である。結果と視覚化はhttps://iterative-img-gen.github.io/で公開されている。

論文の概要: Iterative Refinement Improves Compositional Image Generation

関連論文リスト