Fugu-MT 論文翻訳(概要): Iterative Prompt Refinement for Safer Text-to-Image Generation

論文の概要: Iterative Prompt Refinement for Safer Text-to-Image Generation

arxiv url: http://arxiv.org/abs/2509.13760v1
Date: Wed, 17 Sep 2025 07:16:06 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-18 18:41:50.754505
Title: Iterative Prompt Refinement for Safer Text-to-Image Generation
Title（参考訳）: テキスト・ツー・イメージ生成のための反復的プロンプト補正
Authors: Jinwoo Jeon, JunHyeok Oh, Hayeong Lee, Byung-Jun Lee,
Abstract要約: 既存の安全手法は、大言語モデル(LLM)を用いて、通常、プロンプトを洗練させる。本稿では、視覚言語モデル(VLM)を用いて、入力プロンプトと生成された画像の両方を解析する反復的プロンプト改善アルゴリズムを提案する。提案手法は,ユーザの意図と整合性を損なうことなく,より安全なT2Iコンテンツを生成するための実用的なソリューションを提供する。
参考スコア（独自算出の注目度）: 4.174845397893041
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text-to-Image (T2I) models have made remarkable progress in generating images from text prompts, but their output quality and safety still depend heavily on how prompts are phrased. Existing safety methods typically refine prompts using large language models (LLMs), but they overlook the images produced, which can result in unsafe outputs or unnecessary changes to already safe prompts. To address this, we propose an iterative prompt refinement algorithm that uses Vision Language Models (VLMs) to analyze both the input prompts and the generated images. By leveraging visual feedback, our method refines prompts more effectively, improving safety while maintaining user intent and reliability comparable to existing LLM-based approaches. Additionally, we introduce a new dataset labeled with both textual and visual safety signals using off-the-shelf multi-modal LLM, enabling supervised fine-tuning. Experimental results demonstrate that our approach produces safer outputs without compromising alignment with user intent, offering a practical solution for generating safer T2I content. Our code is available at https://github.com/ku-dmlab/IPR. \textbf{\textcolor{red}WARNING: This paper contains examples of harmful or inappropriate images generated by models.
Abstract（参考訳）: テキスト・ツー・イメージ(T2I)モデルは、テキスト・プロンプトから画像を生成するのに顕著な進歩を遂げているが、その出力品質と安全性は、プロンプトがどのように表現されるかに大きく依存している。既存の安全手法は、通常、大きな言語モデル(LLM)を使用してプロンプトを洗練させるが、生成された画像を見落としているため、安全でない出力や、既に安全なプロンプトに対する不要な変更をもたらす可能性がある。そこで本研究では、視覚言語モデル(VLM)を用いて、入力プロンプトと生成された画像の両方を解析する反復的プロンプト改善アルゴリズムを提案する。視覚フィードバックを活用することにより,既存のLCM手法に匹敵するユーザ意図と信頼性を維持しつつ,安全性を向上する。さらに,市販のマルチモーダルLCMを用いて,テキスト信号と視覚的安全信号の両方をラベル付けした新しいデータセットを導入し,教師付き微調整を可能にした。実験の結果,提案手法はユーザ意図と整合性を損なうことなく,より安全なT2Iコンテンツを生成するための実用的なソリューションを提供する。私たちのコードはhttps://github.com/ku-dmlab/IPR.comで公開されています。 textbf{\textcolor{red}WARNING: モデルによって生成された有害または不適切なイメージの例を含む。

論文の概要: Iterative Prompt Refinement for Safer Text-to-Image Generation

関連論文リスト