Fugu-MT 論文翻訳(概要): Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm

論文の概要: Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm

arxiv url: http://arxiv.org/abs/2605.12271v1
Date: Tue, 12 May 2026 15:35:34 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-13 21:48:56.969531
Title: Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm
Title（参考訳）: テキストプロンプトを超えて:統一パラダイムとしてのビジュアル・ツー・ビジュアル・ジェネレーション
Authors: Yaofang Liu, Kangning Cui, Meng Chu, Zhaoqing Li, Suiyun Zhang, Jean-Michel Morel, Xiaodong Cun, Haoxuan Che, Rui Liu, Raymond H. Chan,
Abstract要約: ユーザがテキストプロンプトではなく、視覚仕様ページで生成モデルを条件付けするtextbf Visual-to-visual (V2V) 生成を提案する。 textbfV2V-Zeroは、既存の視覚言語モデル(VLM)条件付きジェネレータでこのインターフェースを公開する、トレーニング不要のフレームワークである。 V2V-Zeroのスコアは32.7/100で、評価されたオープンウェイトなイメージベースラインを上回り、明確な機能階層を明らかにしている。
参考スコア（独自算出の注目度）: 27.86374820495554
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Humans often specify and create through visual artifacts: typography sheets, sketches, reference images, and annotated scenes. Yet modern visual generators still ask users to serialize this intent into text, a bottleneck that compresses signals like spatial structure, exact appearance, and glyph shape. We propose \textbf{\emph{visual-to-visual} (V2V)} generation, in which the user conditions a generative model with a visual specification page rather than a text prompt. The page is not an edit target, but a visual document that specifies the desired output. We introduce \textbf{V2V-Zero}, a training-free framework that exposes this interface in existing vision-language model (VLM) conditioned generators by replacing text-only conditioning with final-layer hidden states extracted from visual pages, exploiting the fact that the frozen VLM already maps both text and images into the generator's conditioning space. On GenEval, V2V-Zero reaches 0.85 with a frozen Qwen-Image backbone, closely matching its optimized text-to-image performance without fine-tuning. To evaluate the broader V2V space, we introduce \textbf{Simple-V2V Bench}, spanning seven visual-conditioning tasks and seven models, including GPT Image 2, Nano Banana 2, Seedream 5.0 Lite, open-weight baselines, and a video extension. V2V-Zero scores 32.7/100, outperforming evaluated open-weight image baselines and revealing a clear capability hierarchy: attribute binding is strong, content generation is unreliable, and structural control remains hard even for commercial systems. A HunyuanVideo-1.5 extension scores 20.2/100, showing the interface transfers beyond images. Mechanistic analysis shows the default reasoning path is primarily visually routed, with 95.0\% of conditioning-token attention mass on visual-page hidden states.
Abstract（参考訳）: 人間はしばしば視覚的アーティファクト(タイポグラフィーシート、スケッチ、参照画像、注釈付きシーン)を指定して作成する。しかし、現代のビジュアルジェネレータは、この意図をテキストにシリアライズするようユーザーに求めている。本稿では、ユーザがテキストプロンプトではなく、視覚仕様ページで生成モデルを条件付けする「textbf{\emph{visual-to-visual} (V2V)} 生成を提案する。ページは編集対象ではなく、所望の出力を指定するビジュアルドキュメントである。我々は,このインタフェースを既存の視覚言語モデル(VLM)条件付きジェネレータに公開する学習自由フレームワークである \textbf{V2V-Zero} を紹介し,テキストのみの条件付けを視覚ページから抽出した最終層隠れ状態に置き換え,凍結したVLM が既にテキストと画像の両方をジェネレータの条件付け空間にマッピングしているという事実を活用する。 GenEvalでは、V2V-Zeroは凍結したQwenイメージのバックボーンで0.85に達する。広義のV2V空間を評価するために,GPT Image 2, Nano Banana 2, Seedream 5.0 Lite, オープンウェイトベースライン, ビデオ拡張を含む7つの視覚条件タスクと7つのモデルにまたがる \textbf{Simple-V2V Bench} を導入する。 V2V-Zeroのスコアは32.7/100で、評価されたオープンウェイトなイメージベースラインを上回り、明確な機能階層を明らかにしている。 HunyuanVideo-1.5拡張は20.2/100で、画像以外のインターフェイス転送を示す。メカニスティック解析は、デフォルトの推論パスが主に視覚的にルーティングされ、95.0\%の条件付き注意質量が視覚的なページ隠蔽状態にあることを示している。

論文の概要: Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm

関連論文リスト