Fugu-MT 論文翻訳(概要): Semantic Browsing: Controllable Diversity for Image Generation

論文の概要: Semantic Browsing: Controllable Diversity for Image Generation

arxiv url: http://arxiv.org/abs/2606.23679v1
Date: Mon, 22 Jun 2026 17:59:17 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-24 17:11:31.982487
Title: Semantic Browsing: Controllable Diversity for Image Generation
Title（参考訳）: セマンティックブラウジング:画像生成のための制御可能な多様性
Authors: Sara Dorfman, Maya Vishnevsky, Omer Dahary, Or Patashnik, Daniel Cohen-Or,
Abstract要約: 本稿では,セマンティックブラウズを可能にする多様性制御手法を提案する。我々は、最近のテキスト・ツー・イメージモデルが精巧なキャプションで訓練されているという事実を活用している。これはパラダイムシフトを可能にします – テキスト・ツー・イメージモデル内のバリエーションに頼るのではなく,テキストレベルで直接多様性を誘導します。
参考スコア（独自算出の注目度）: 51.503726779537
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Modern text-to-image models excel in visual fidelity and prompt adherence. However, this strict adherence comes at the cost of diversity: generated samples tend to collapse into a single visual interpretation. Existing methods to improve diversity produce outputs driven by incidental variations rather than meaningful design choices. This motivates a new variant of the diversity task where structure is enforced on the generated samples. We introduce a method for controlled diversity that enables Semantic Browsing, where users can navigate structured image galleries and experience creative exploration through a systematic traversal of meaningful, interpretable axes of variation. Achieving this level of semantic control requires a deep understanding of the scene. We exploit the fact that recent text-to-image models are trained on elaborated captions, effectively decoupling semantic decision-making from pixel generation. This enables a paradigm shift: instead of relying on stochastic variation within the text-to-image model, we induce diversity directly at the text level. By leveraging rich textual representations, we allow a Vision Language Model (VLM) to operate on the full scene context. To overcome the generic outputs typical of standard VLMs, we employ an agentic workflow that explicitly enforces structured variation attuned to the original prompt. We demonstrate that our method produces diverse and navigable design spaces where every variation corresponds to a specific, user-understandable semantic decision.
Abstract（参考訳）: 現代のテキスト・ツー・イメージモデルでは、視覚的忠実さと迅速な定着性が優れている。しかし、この厳密な固執は多様性の犠牲となり、生成されたサンプルは単一の視覚的解釈に崩壊する傾向がある。多様性を改善する既存の方法は、意味のある設計選択ではなく、偶発的な変化によって引き起こされるアウトプットを生成する。これは、生成したサンプルに構造を強制する多様性タスクの新しい変種を動機付けている。本稿では,セマンティックブラウジングを実現するための多様性制御手法を提案する。これは,ユーザが構造化画像ギャラリーをナビゲートし,有意義で解釈可能な変動軸の体系的なトラバースを通じて創造的な探索を体験することを可能にする。このレベルのセマンティックコントロールを実現するには、シーンを深く理解する必要がある。我々は、最近のテキスト・ツー・イメージモデルが精巧なキャプションに基づいて訓練されているという事実を活用し、ピクセル生成から意味決定を効果的に分離する。これにより、テキスト・ツー・イメージモデルにおける確率的変動に頼る代わりに、テキストレベルで直接多様性を誘導するパラダイムシフトが可能になる。リッチなテキスト表現を活用することで、視覚言語モデル(VLM)がシーン全体のコンテキストで動作できるようにする。標準VLMの典型的な一般的な出力を克服するために、エージェントワークフローを使用し、元のプロンプトに適応した構造変化を明示的に強制する。提案手法は,各変数が特定のユーザ理解可能な意味決定に対応する多様かつナビゲート可能な設計空間を創出することを示す。

論文の概要: Semantic Browsing: Controllable Diversity for Image Generation

関連論文リスト