ConsID-Gen: View-Consistent and Identity-Preserving Image-to-Video Generation
- URL: http://arxiv.org/abs/2602.10113v1
- Date: Tue, 10 Feb 2026 18:59:51 GMT
- Title: ConsID-Gen: View-Consistent and Identity-Preserving Image-to-Video Generation
- Authors: Mingyang Wu, Ashirbad Mishra, Soumik Dey, Shuo Xing, Naveen Ravipati, Hansi Wu, Binbin Li, Zhengzhong Tu,
- Abstract summary: Image-to-Video generation (I2V) animates a static image into a temporally coherent video sequence following textual instructions.<n>Existing I2V pipelines often suffer from appearance drift and geometric distortion.<n>We propose ConsID-Gen, a view-assisted I2V generation framework that augments the first frame with unposed auxiliary views.
- Score: 14.141157176094737
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Image-to-Video generation (I2V) animates a static image into a temporally coherent video sequence following textual instructions, yet preserving fine-grained object identity under changing viewpoints remains a persistent challenge. Unlike text-to-video models, existing I2V pipelines often suffer from appearance drift and geometric distortion, artifacts we attribute to the sparsity of single-view 2D observations and weak cross-modal alignment. Here we address this problem from both data and model perspectives. First, we curate ConsIDVid, a large-scale object-centric dataset built with a scalable pipeline for high-quality, temporally aligned videos, and establish ConsIDVid-Bench, where we present a novel benchmarking and evaluation framework for multi-view consistency using metrics sensitive to subtle geometric and appearance deviations. We further propose ConsID-Gen, a view-assisted I2V generation framework that augments the first frame with unposed auxiliary views and fuses semantic and structural cues via a dual-stream visual-geometric encoder as well as a text-visual connector, yielding unified conditioning for a Diffusion Transformer backbone. Experiments across ConsIDVid-Bench demonstrate that ConsID-Gen consistently outperforms in multiple metrics, with the best overall performance surpassing leading video generation models like Wan2.1 and HunyuanVideo, delivering superior identity fidelity and temporal coherence under challenging real-world scenarios. We will release our model and dataset at https://myangwu.github.io/ConsID-Gen.
Related papers
- MV-S2V: Multi-View Subject-Consistent Video Generation [14.479120381560621]
We propose and address the challenging Multi-View S2V (MV-S2V) task.<n>MV-S2V synthesizes videos from multiple reference views to enforce 3D-level subject consistency.<n>Our framework achieves superior 3D subject consistency w.r.t. multi-view reference images and high-quality visual outputs.
arXiv Detail & Related papers (2026-01-25T09:02:33Z) - MVGGT: Multimodal Visual Geometry Grounded Transformer for Multiview 3D Referring Expression Segmentation [59.75554954111619]
We introduce Multi-view 3D Referring Expression (MV-3DRES), where the model must recover scene structure and segment the referred object directly from sparse multi-view images.<n>We propose the Multimodal Visual Geometry Grounded Transformer (MVGGT), an efficient end-to-end framework that integrates language information into sparse-view geometric reasoning.<n>Experiments show that MVGGT establishes the first strong baseline and achieves both high accuracy and fast inference, outperforming existing alternatives.
arXiv Detail & Related papers (2026-01-11T11:44:07Z) - DreamID-V:Bridging the Image-to-Video Gap for High-Fidelity Face Swapping via Diffusion Transformer [21.788582116033684]
Video Face Swapping (VFS) requires seamlessly injecting a source identity into a target video.<n>Existing methods struggle to maintain identity similarity and attribute preservation while preserving temporal consistency.<n>We propose a comprehensive framework to seamlessly transfer the superiority of Image Face Swapping to the video domain.
arXiv Detail & Related papers (2026-01-04T08:07:11Z) - CtrlVDiff: Controllable Video Generation via Unified Multimodal Video Diffusion [62.04833878126661]
We tackle the dual challenges of video understanding and controllable video generation within a unified diffusion framework.<n>We propose CtrlVDiff, a unified diffusion model trained with a Hybrid Modality Control Strategy (HMCS) that routes and fuses features from depth, normals, segmentation, edges, and graphics-based intrinsics (albedo, roughness, metallic)<n>Across understanding and generation benchmarks, CtrlVDiff delivers superior controllability and fidelity, enabling layer-wise edits (relighting, material adjustment, object insertion) and surpassing state-of-the-art baselines while remaining robust when some modalities are unavailable.
arXiv Detail & Related papers (2025-11-26T07:27:11Z) - BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration [56.98981194478512]
We propose a unified framework that handles a broad range of subject-to-video scenarios.<n>We introduce an MLLM-DiT framework in which a pretrained multimodal large language model performs deep cross-modal reasoning to ground entities.<n>Experiments on the OpenS2V benchmark demonstrate that our method achieves superior performance across subject consistency, naturalness, and text relevance in generated videos.
arXiv Detail & Related papers (2025-10-01T02:41:11Z) - SSG-Dit: A Spatial Signal Guided Framework for Controllable Video Generation [22.1310564466224]
Controllable video generation aims to synthesize video content that aligns precisely with user-provided conditions.<n>Existing models often struggle to maintain strong semantic consistency.<n>We propose SSG-DiT, a novel and efficient framework for high-fidelity controllable video generation.
arXiv Detail & Related papers (2025-08-23T15:30:17Z) - Phantom: Subject-consistent video generation via cross-modal alignment [16.777805813950486]
We propose a unified video generation framework for both single- and multi-subject references.<n>The proposed method achieves high-fidelity subject-consistent video generation while addressing issues of image content leakage and multi-subject confusion.
arXiv Detail & Related papers (2025-02-16T11:02:50Z) - InTraGen: Trajectory-controlled Video Generation for Object Interactions [100.79494904451246]
InTraGen is a pipeline for improved trajectory-based generation of object interaction scenarios.
Our results demonstrate improvements in both visual fidelity and quantitative performance.
arXiv Detail & Related papers (2024-11-25T14:27:50Z) - E2HQV: High-Quality Video Generation from Event Camera via
Theory-Inspired Model-Aided Deep Learning [53.63364311738552]
Bio-inspired event cameras or dynamic vision sensors are capable of capturing per-pixel brightness changes (called event-streams) in high temporal resolution and high dynamic range.
It calls for events-to-video (E2V) solutions which take event-streams as input and generate high quality video frames for intuitive visualization.
We propose textbfE2HQV, a novel E2V paradigm designed to produce high-quality video frames from events.
arXiv Detail & Related papers (2024-01-16T05:10:50Z) - AIGCBench: Comprehensive Evaluation of Image-to-Video Content Generated
by AI [1.1035305628305816]
This paper introduces AIGCBench, a pioneering comprehensive benchmark designed to evaluate a variety of video generation tasks.
A varied and open-domain image-text dataset that evaluates different state-of-the-art algorithms under equivalent conditions.
We employ a novel text combiner and GPT-4 to create rich text prompts, which are then used to generate images via advanced Text-to-Image models.
arXiv Detail & Related papers (2024-01-03T10:08:40Z) - MagDiff: Multi-Alignment Diffusion for High-Fidelity Video Generation and Editing [90.06041718086317]
We propose a unified Multi-alignment Diffusion, dubbed as MagDiff, for both tasks of high-fidelity video generation and editing.
The proposed MagDiff introduces three types of alignments, including subject-driven alignment, adaptive prompts alignment, and high-fidelity alignment.
arXiv Detail & Related papers (2023-11-29T03:36:07Z) - Towards Coding for Human and Machine Vision: A Scalable Image Coding
Approach [104.02201472370801]
We come up with a novel image coding framework by leveraging both the compressive and the generative models.
By introducing advanced generative models, we train a flexible network to reconstruct images from compact feature representations and the reference pixels.
Experimental results demonstrate the superiority of our framework in both human visual quality and facial landmark detection.
arXiv Detail & Related papers (2020-01-09T10:37:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.