Related papers: Boosting Consistency in Story Visualization with Rich-Contextual Conditional Diffusion Models

Boosting Consistency in Story Visualization with Rich-Contextual Conditional Diffusion Models

URL: http://arxiv.org/abs/2407.02482v2
Date: Wed, 3 Jul 2024 18:17:01 GMT
Title: Boosting Consistency in Story Visualization with Rich-Contextual Conditional Diffusion Models
Authors: Fei Shen, Hu Ye, Sibo Liu, Jun Zhang, Cong Wang, Xiao Han, Wei Yang,
Abstract summary: We propose a novel Rich-contextual Diffusion Models (RCDMs) to enhance story generation's semantic consistency and temporal consistency. RCDMs can generate consistent stories with a single forward inference compared to autoregressive models.
Score: 12.907590808274358
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent research showcases the considerable potential of conditional diffusion models for generating consistent stories. However, current methods, which predominantly generate stories in an autoregressive and excessively caption-dependent manner, often underrate the contextual consistency and relevance of frames during sequential generation. To address this, we propose a novel Rich-contextual Conditional Diffusion Models (RCDMs), a two-stage approach designed to enhance story generation's semantic consistency and temporal consistency. Specifically, in the first stage, the frame-prior transformer diffusion model is presented to predict the frame semantic embedding of the unknown clip by aligning the semantic correlations between the captions and frames of the known clip. The second stage establishes a robust model with rich contextual conditions, including reference images of the known clip, the predicted frame semantic embedding of the unknown clip, and text embeddings of all captions. By jointly injecting these rich contextual conditions at the image and feature levels, RCDMs can generate semantic and temporal consistency stories. Moreover, RCDMs can generate consistent stories with a single forward inference compared to autoregressive models. Our qualitative and quantitative results demonstrate that our proposed RCDMs outperform in challenging scenarios. The code and model will be available at https://github.com/muzishen/RCDMs.

Related papers

Bridging the Discrete-Continuous Gap: Unified Multimodal Generation via Coupled Manifold Discrete Absorbing Diffusion [60.186310080523135]
Bifurcation of generative modeling into autoregressive approaches for discrete data (text) and diffusion approaches for continuous data (images) hinders development of truly unified multimodal systems.<n>We propose textbfCoM-DAD, a novel probabilistic framework that reformulates multimodal generation as a hierarchical dual-process.<n>Our method demonstrates superior stability over standard masked modeling, establishing a new paradigm for scalable, unified text-image generation.
arXiv Detail & Related papers (2026-01-07T16:21:19Z)
OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory [47.073128448877775]
We propose OneStory, enabling global yet compact cross-shot context modeling for consistent and scalable narrative generation.<n>OneStory reformulates MSV as a next-shot generation task, enabling autoregressive shot synthesis while leveraging pretrained image-to-video (I2V) models for strong visual conditioning.<n>OneStory achieves state-of-the-art narrative coherence across diverse and complex scenes in both text- and image-conditioned settings.
arXiv Detail & Related papers (2025-12-08T18:32:24Z)
Bridging Vision and Language: Modeling Causality and Temporality in Video Narratives [0.0]
We propose an enhanced framework that integrates a Causal-Temporal Reasoning Module into state-of-the-art LVLMs. CTRM comprises two key components: the Causal Dynamics (CDE) and the Temporal Learner (TRL) We design a multi-stage learning strategy to optimize the model, combining pre-training on large-scale video-text datasets.
arXiv Detail & Related papers (2024-12-14T07:28:38Z)
Meta-DiffuB: A Contextualized Sequence-to-Sequence Text Diffusion Model with Meta-Exploration [53.63593099509471]
We propose a scheduler-exploiter S2S-Diffusion paradigm designed to overcome the limitations of existing S2S-Diffusion models. We employ Meta-Exploration to train an additional scheduler model dedicated to scheduling contextualized noise for each sentence. Our exploiter model, an S2S-Diffusion model, leverages the noise scheduled by our scheduler model for updating and generation.
arXiv Detail & Related papers (2024-10-17T04:06:02Z)
NarrativeBridge: Enhancing Video Captioning with Causal-Temporal Narrative [19.79736018383692]
Existing video captioning benchmarks and models lack coherent representations of causal-temporal narrative. We propose NarrativeBridge, an approach comprising of: (1) a novel Causal-Temporal Narrative (CTN) captions benchmark generated using a large language model and few-shot prompting; and (2) a dedicated Cause-Effect Network (CEN) architecture with separate encoders for capturing cause and effect dynamics independently.
arXiv Detail & Related papers (2024-06-10T17:34:24Z)
StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation [117.13475564834458]
We propose a new way of self-attention calculation, termed Consistent Self-Attention. To extend our method to long-range video generation, we introduce a novel semantic space temporal motion prediction module. By merging these two novel components, our framework, referred to as StoryDiffusion, can describe a text-based story with consistent images or videos.
arXiv Detail & Related papers (2024-05-02T16:25:16Z)
LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation? [10.72249123249003]
We revisit diffusion models, highlighting their capacity for holistic context modeling and parallel decoding. We introduce a novel architecture, LaDiC, which utilizes a split BERT to create a dedicated latent space for captions. LaDiC achieves state-of-the-art performance for diffusion-based methods on the MS dataset with 38.2 BLEU@4 and 126.2 CIDEr.
arXiv Detail & Related papers (2024-04-16T17:47:16Z)
Contextualized Diffusion Models for Text-Guided Image and Video Generation [67.69171154637172]
Conditional diffusion models have exhibited superior performance in high-fidelity text-guided visual generation and editing. We propose a novel and general contextualized diffusion model (ContextDiff) by incorporating the cross-modal context encompassing interactions and alignments between text condition and visual sample. We generalize our model to both DDPMs and DDIMs with theoretical derivations, and demonstrate the effectiveness of our model in evaluations with two challenging tasks: text-to-image generation, and text-to-video editing.
arXiv Detail & Related papers (2024-02-26T15:01:16Z)
Causal-Story: Local Causal Attention Utilizing Parameter-Efficient Tuning For Visual Story Synthesis [12.766712398098646]
We propose Causal-Story, which considers the causal relationship between previous captions, frames, and current captions. We evaluate our model on the PororoSV and FlintstonesSV datasets and obtained state-of-the-art FID scores.
arXiv Detail & Related papers (2023-09-18T08:06:06Z)
RealignDiff: Boosting Text-to-Image Diffusion Model with Coarse-to-fine Semantic Re-alignment [112.45442468794658]
We propose a two-stage coarse-to-fine semantic re-alignment method, named RealignDiff. In the coarse semantic re-alignment phase, a novel caption reward is proposed to evaluate the semantic discrepancy between the generated image caption and the given text prompt. The fine semantic re-alignment stage employs a local dense caption generation module and a re-weighting attention modulation module to refine the previously generated images from a local semantic view.
arXiv Detail & Related papers (2023-05-31T06:59:21Z)
Make-A-Story: Visual Memory Conditioned Consistent Story Generation [57.691064030235985]
We propose a novel autoregressive diffusion-based framework with a visual memory module that implicitly captures the actor and background context. Our method outperforms prior state-of-the-art in generating frames with high visual quality. Our experiments for story generation on the MUGEN, the PororoSV and the FlintstonesSV dataset show that our method not only outperforms prior state-of-the-art in generating frames with high visual quality, but also models appropriate correspondences between the characters and the background.
arXiv Detail & Related papers (2022-11-23T21:38:51Z)
Improving Generation and Evaluation of Visual Stories via Semantic Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions. Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task. We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.