Related papers: MV-S2V: Multi-View Subject-Consistent Video Generation

MV-S2V: Multi-View Subject-Consistent Video Generation

URL: http://arxiv.org/abs/2601.17756v2
Date: Tue, 27 Jan 2026 13:24:40 GMT
Title: MV-S2V: Multi-View Subject-Consistent Video Generation
Authors: Ziyang Song, Xinyu Gong, Bangya Liu, Zelin Zhao,
Abstract summary: We propose and address the challenging Multi-View S2V (MV-S2V) task.<n>MV-S2V synthesizes videos from multiple reference views to enforce 3D-level subject consistency.<n>Our framework achieves superior 3D subject consistency w.r.t. multi-view reference images and high-quality visual outputs.
Score: 14.479120381560621
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing Subject-to-Video Generation (S2V) methods have achieved high-fidelity and subject-consistent video generation, yet remain constrained to single-view subject references. This limitation renders the S2V task reducible to an S2I + I2V pipeline, failing to exploit the full potential of video subject control. In this work, we propose and address the challenging Multi-View S2V (MV-S2V) task, which synthesizes videos from multiple reference views to enforce 3D-level subject consistency. Regarding the scarcity of training data, we first develop a synthetic data curation pipeline to generate highly customized synthetic data, complemented by a small-scale real-world captured dataset to boost the training of MV-S2V. Another key issue lies in the potential confusion between cross-subject and cross-view references in conditional generation. To overcome this, we further introduce Temporally Shifted RoPE (TS-RoPE) to distinguish between different subjects and distinct views of the same subject in reference conditioning. Our framework achieves superior 3D subject consistency w.r.t. multi-view reference images and high-quality visual outputs, establishing a new meaningful direction for subject-driven video generation. Our project page is available at: https://szy-young.github.io/mv-s2v

Related papers

ConsID-Gen: View-Consistent and Identity-Preserving Image-to-Video Generation [14.141157176094737]
Image-to-Video generation (I2V) animates a static image into a temporally coherent video sequence following textual instructions.<n>Existing I2V pipelines often suffer from appearance drift and geometric distortion.<n>We propose ConsID-Gen, a view-assisted I2V generation framework that augments the first frame with unposed auxiliary views.
arXiv Detail & Related papers (2026-02-10T18:59:51Z)
Scaling Zero-Shot Reference-to-Video Generation [45.15099584926898]
We introduce Saber, a scalable zero-shot framework that requires no explicit R2V data.<n>Saber employs a masked training strategy and a tailored attention-based model design to learn identity-consistent and reference-aware representations.<n>It achieves superior performance on the OpenS2V-Eval benchmark compared to methods trained with R2V data.
arXiv Detail & Related papers (2025-12-07T16:10:25Z)
BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration [56.98981194478512]
We propose a unified framework that handles a broad range of subject-to-video scenarios.<n>We introduce an MLLM-DiT framework in which a pretrained multimodal large language model performs deep cross-modal reasoning to ground entities.<n>Experiments on the OpenS2V benchmark demonstrate that our method achieves superior performance across subject consistency, naturalness, and text relevance in generated videos.
arXiv Detail & Related papers (2025-10-01T02:41:11Z)
LatentMove: Towards Complex Human Movement Video Generation [35.83863053692456]
We present LatentMove, a DiT-based framework specifically tailored for highly dynamic human animation.<n>Our architecture incorporates a conditional control branch and learnable face/body tokens to preserve consistency as well as fine-grained details across frames.<n>We introduce Complex-Human-Videos (CHV), a dataset featuring diverse, challenging human motions designed to benchmark the robustness of I2V systems.
arXiv Detail & Related papers (2025-05-28T07:10:49Z)
SkyReels-A2: Compose Anything in Video Diffusion Transformers [27.324119455991926]
This paper presents SkyReels-A2, a controllable video generation framework capable of assembling arbitrary visual elements into synthesized videos.<n>We term this task elements-to-video (E2V) whose primary challenges lie in preserving the fidelity of each reference element, ensuring coherent composition of the scene, and achieving natural outputs.<n>We propose a novel image-text joint embedding model to inject multi-element representations into the generative process, balancing element-specific consistency with global coherence and text alignment.
arXiv Detail & Related papers (2025-04-03T09:50:50Z)
InTraGen: Trajectory-controlled Video Generation for Object Interactions [100.79494904451246]
InTraGen is a pipeline for improved trajectory-based generation of object interaction scenarios. Our results demonstrate improvements in both visual fidelity and quantitative performance.
arXiv Detail & Related papers (2024-11-25T14:27:50Z)
Vivid-ZOO: Multi-View Video Generation with Diffusion Model [76.96449336578286]
New challenges lie in the lack of massive captioned multi-view videos and the complexity of modeling such multi-dimensional distribution. We propose a novel diffusion-based pipeline that generates high-quality multi-view videos centered around a dynamic 3D object from text.
arXiv Detail & Related papers (2024-06-12T21:44:04Z)
VideoTetris: Towards Compositional Text-to-Video Generation [45.395598467837374]
VideoTetris is a framework that enables compositional T2V generation. We show that VideoTetris achieves impressive qualitative and quantitative results in T2V generation.
arXiv Detail & Related papers (2024-06-06T17:25:33Z)
Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation [72.90144343056227]
We explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks. We introduce a novel framework, termed "VD-IT", tailored with dedicatedly designed components built upon a fixed T2V model. Our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-18T17:59:58Z)
Make-A-Video: Text-to-Video Generation without Text-Video Data [69.20996352229422]
Make-A-Video is an approach for translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V) We design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules. In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation.
arXiv Detail & Related papers (2022-09-29T13:59:46Z)
Make It Move: Controllable Image-to-Video Generation with Text Descriptions [69.52360725356601]
TI2V task aims at generating videos from a static image and a text description. To address these challenges, we propose a Motion Anchor-based video GEnerator (MAGE) with an innovative motion anchor structure. Experiments conducted on datasets verify the effectiveness of MAGE and show appealing potentials of TI2V task.
arXiv Detail & Related papers (2021-12-06T07:00:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.