Fugu-MT 論文翻訳(概要): RAVA: Retrieval-Augmented Viewpoint Alignment for Subject-Driven Image Generation

論文の概要: RAVA: Retrieval-Augmented Viewpoint Alignment for Subject-Driven Image Generation

arxiv url: http://arxiv.org/abs/2606.17619v1
Date: Tue, 16 Jun 2026 07:25:08 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-17 17:15:32.33077
Title: RAVA: Retrieval-Augmented Viewpoint Alignment for Subject-Driven Image Generation
Title（参考訳）: RAVA:対象駆動画像生成のための検索拡張視点アライメント
Authors: Qiwei Yan, Zhiqiang Yuan, Chongyang Li, Jiapei Zhang, Ying Deng, Jinchao Zhang, Jie Zhou,
Abstract要約: クロスオブジェクト視点アライメントは、参照駆動画像生成における課題である。生成前に明示的な幾何学的証拠を提供する検索拡張フレームワークであるRAVAを提案する。 RAVAは、クロスオブジェクト生成におけるゼロショットベースラインを一貫して上回る。
参考スコア（独自算出の注目度）: 21.751544721133005
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reference-driven image generation has made rapid progress on identity preservation, but reliable viewpoint control across different subjects remains poorly understood. The difficulty is not merely generating a new image of the target subject: the model must infer the implicit viewpoint of one subject and transfer it to another subject using only image-level evidence, without camera poses, depth, or ray-based conditions. In this setting, existing generators conditioned on multiple image references often rely on spurious semantic correlations, which lead to viewpoint drift, part-level structural mismatches, and missing or unsupported target-specific content. We formulate this challenge as cross-subject viewpoint alignment and propose RAVA, a retrieval-augmented framework that supplies explicit geometric evidence before generation. RAVA first learns a cross-instance viewpoint embedding that retrieves target-subject images aligned with the anchor viewpoint, then applies a LogDet-based subset selection strategy to retain a compact reference set that is both view-consistent and structurally complementary. The selected references are finally consumed by a fine-tuned multi-reference image generator. Experiments show that generic semantic embeddings are nearly random for this task, while the proposed retriever substantially improves viewpoint retrieval quality. On cross-subject generation, RAVA consistently outperforms zero-shot baselines and stronger retrieval alternatives under the same generation backbone. These results indicate that cross-subject viewpoint alignment benefits from retrieval-augmented geometric grounding rather than relying on end-to-end generation alone.
Abstract（参考訳）: 参照駆動画像生成は、アイデンティティ保存を急速に進歩させたが、異なる対象に対する信頼性の高い視点制御は、いまだに理解されていない。モデルは、ある被写体の暗黙の視点を推論し、カメラのポーズ、深度、光線に基づく条件なしに、画像レベルの証拠のみを使用して別の被写体に転送する必要がある。この設定では、複数の画像参照に条件付けされた既存のジェネレータは、しばしば、視点のドリフト、部分レベルの構造ミスマッチ、または、サポートされていないターゲット固有のコンテンツに繋がる、突発的な意味的相関に頼っている。我々は、この課題をオブジェクト間の視点アライメントとして定式化し、生成前に明示的な幾何学的証拠を提供する検索強化フレームワークであるRAVAを提案する。 RAVAはまず、アンカー視点に沿ったターゲットオブジェクト画像を取得するクロスインスタンス視点埋め込みを学習し、次にLogDetベースのサブセット選択戦略を適用して、ビュー一貫性と構造的に相補的なコンパクトな参照セットを保持する。選択された参照は、最終的に微調整されたマルチ参照画像生成装置によって消費される。実験により, 汎用的なセマンティックな埋め込みがほぼランダムであることを示し, 提案手法は視点検索の精度を大幅に向上させる。クロスオブジェクト生成では、RAVAはゼロショットベースラインと、同じ世代のバックボーンの下でより強力な検索代替品を一貫して上回る。これらの結果から, 物体間視線アライメントは, 端対端生成のみに依存するのではなく, 検索強化幾何グラウンドリングの恩恵を受けることが示唆された。

論文の概要: RAVA: Retrieval-Augmented Viewpoint Alignment for Subject-Driven Image Generation

関連論文リスト