Fugu-MT 論文翻訳(概要): ReCap: Lightweight Referential Grounding for Coherent Story Visualization

論文の概要: ReCap: Lightweight Referential Grounding for Coherent Story Visualization

arxiv url: http://arxiv.org/abs/2604.18575v1
Date: Mon, 20 Apr 2026 17:57:50 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:53.039668
Title: ReCap: Lightweight Referential Grounding for Coherent Story Visualization
Title（参考訳）: ReCap:コヒーレントなストーリー可視化のための軽量参照グラウンド
Authors: Aditya Arora, Akshita Gupta, Pau Rodriguez, Marcus Rohrbach,
Abstract要約: ReCapは、基本拡散バックボーンを変更することなく、文字安定性と視覚的忠実性を改善する軽量な一貫性フレームワークである。 ReCap's CORE (Conditional frame Reference) モジュールは、アナポーを視覚的アンカーとして扱う。我々は、ストーリーの可視化を、リアル映画から派生した人間中心の物語に拡張し、スタイル化された漫画ドメインを超えてReCapの能力を実証する。
参考スコア（独自算出の注目度）: 11.022891519635834
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Story Visualization aims to generate a sequence of images that faithfully depicts a textual narrative that preserve character identity, spatial configuration, and stylistic coherence as the narratives unfold. Maintaining such cross-frame consistency has traditionally relied on explicit memory banks, architectural expansion, or auxiliary language models, resulting in substantial parameter growth and inference overhead. We introduce ReCap, a lightweight consistency framework that improves character stability and visual fidelity without modifying the base diffusion backbone. ReCap's CORE (COnditional frame REferencing) module treats anaphors, in our case pronouns, as visual anchors, activating only when characters are referred to by a pronoun and conditioning on the preceding frame to propagate visual identity. This selective design avoids unconditional cross-frame conditioning and introduces only 149K additional parameters, a fraction of the cost of memory-bank and LLM-augmented approaches. To further stabilize identity, we incorporate SemDrift (Guided Semantic Drift Correction) applied only during training. When text is vague or referential, the denoiser lacks a visual anchor for identity-defining attributes, causing character appearance to drift across frames, SemDrift corrects this by aligning denoiser representations with pretrained DINOv3 visual embeddings, enforcing semantic identity stability at zero inference cost. ReCap outperforms previous state-of-the-art, StoryGPT-V, on the two main benchmarks for story visualization by 2.63% Character-Accuracy on FlintstonesSV and by 5.65% on PororoSV, establishing a new state-of-the-art character consistency on both benchmarks. Furthermore, we extend story visualization to human-centric narratives derived from real films, demonstrating the capability of ReCap beyond stylized cartoon domains.
Abstract（参考訳）: ストーリービジュアライゼーションは、物語が展開するにつれて、キャラクターのアイデンティティ、空間的構成、スタイル的コヒーレンスを保持するテキスト物語を忠実に描写する一連の画像を生成することを目的としている。このようなクロスフレームの一貫性を維持するには、従来は明示的なメモリバンク、アーキテクチャ拡張、あるいは補助的な言語モデルに頼っていた。基本拡散バックボーンを変更することなく文字安定性と視覚的忠実度を向上させる軽量な一貫性フレームワークであるReCapを紹介する。 ReCap's CORE (Conditional frame Reference) モジュールは、アナポーを視覚的アンカーとして扱う。この選択的な設計は、非条件のクロスフレーム条件を回避し、メモリバンクとLLM拡張アプローチのコストのごく一部である149Kの追加パラメータしか導入しない。さらに身元を安定させるために、トレーニング中にのみ適用されるSemDrift(Guided Semantic Drift Correction)を組み込んだ。テキストがあいまいな場合、デノイザは識別定義属性の視覚的アンカーを欠き、フレームをまたいで文字の外観がドリフトする。SemDriftは、デノイザ表現を事前訓練されたDINOv3ビジュアル埋め込みと整列させ、セマンティックアイデンティティの安定性を推論コストゼロで強制することでこれを修正する。 ReCapは、FlintstonesSVで2.63%、PororoSVで5.65%、ストーリービジュアライゼーションのための2つの主要なベンチマークで、以前の最先端のStoryGPT-Vより優れており、両方のベンチマークで新しい最先端のキャラクタ一貫性を確立している。さらに、実写映画から派生した人間中心の物語にストーリービジュアライゼーションを拡張し、スタイリングされた漫画ドメインを超えてReCapの能力を実証する。

論文の概要: ReCap: Lightweight Referential Grounding for Coherent Story Visualization

関連論文リスト