Related papers: Scaling Zero-Shot Reference-to-Video Generation

Scaling Zero-Shot Reference-to-Video Generation

URL: http://arxiv.org/abs/2512.06905v1
Date: Sun, 07 Dec 2025 16:10:25 GMT
Title: Scaling Zero-Shot Reference-to-Video Generation
Authors: Zijian Zhou, Shikun Liu, Haozhe Liu, Haonan Qiu, Zhaochong An, Weiming Ren, Zhiheng Liu, Xiaoke Huang, Kam Woh Ng, Tian Xie, Xiao Han, Yuren Cong, Hang Li, Chuyan Zhu, Aditya Patel, Tao Xiang, Sen He,
Abstract summary: We introduce Saber, a scalable zero-shot framework that requires no explicit R2V data.<n>Saber employs a masked training strategy and a tailored attention-based model design to learn identity-consistent and reference-aware representations.<n>It achieves superior performance on the OpenS2V-Eval benchmark compared to methods trained with R2V data.
Score: 45.15099584926898
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reference-to-video (R2V) generation aims to synthesize videos that align with a text prompt while preserving the subject identity from reference images. However, current R2V methods are hindered by the reliance on explicit reference image-video-text triplets, whose construction is highly expensive and difficult to scale. We bypass this bottleneck by introducing Saber, a scalable zero-shot framework that requires no explicit R2V data. Trained exclusively on video-text pairs, Saber employs a masked training strategy and a tailored attention-based model design to learn identity-consistent and reference-aware representations. Mask augmentation techniques are further integrated to mitigate copy-paste artifacts common in reference-to-video generation. Moreover, Saber demonstrates remarkable generalization capabilities across a varying number of references and achieves superior performance on the OpenS2V-Eval benchmark compared to methods trained with R2V data.

Related papers

MV-S2V: Multi-View Subject-Consistent Video Generation [14.479120381560621]
We propose and address the challenging Multi-View S2V (MV-S2V) task.<n>MV-S2V synthesizes videos from multiple reference views to enforce 3D-level subject consistency.<n>Our framework achieves superior 3D subject consistency w.r.t. multi-view reference images and high-quality visual outputs.
arXiv Detail & Related papers (2026-01-25T09:02:33Z)
ContextAnyone: Context-Aware Diffusion for Character-Consistent Text-to-Video Generation [36.29956463871403]
Text-to-video (T2V) generation has advanced rapidly, yet maintaining consistent character identities across scenes remains a major challenge.<n>We propose textbfContextAnyone, a context-aware diffusion framework that achieves character-consistent video generation from text and a single reference image.<n>Our method jointly reconstructs the reference image and generates new video frames, enabling the model to fully perceive and utilize reference information.
arXiv Detail & Related papers (2025-12-08T09:12:18Z)
ViSS-R1: Self-Supervised Reinforcement Video Reasoning [84.1180294023835]
We introduce a novel self-supervised reinforcement learning GRPO algorithm (Pretext-GRPO) within the standard R1 pipeline.<n>We also propose the ViSS-R1 framework, which streamlines and integrates pretext-task-based self-supervised learning directly into the MLLM's R1 post-training paradigm.
arXiv Detail & Related papers (2025-11-17T07:00:42Z)
RISE-T2V: Rephrasing and Injecting Semantics with LLM for Expansive Text-to-Video Generation [19.127189099122244]
We introduce RISE-T2V, which uniquely integrates the processes of prompt rephrasing and semantic feature extraction into a single step.<n>We propose an innovative module called the Rephrasing Adapter, enabling diffusion models to utilize text hidden states.
arXiv Detail & Related papers (2025-11-06T12:42:03Z)
TokenAR: Multiple Subject Generation via Autoregressive Token-level enhancement [87.82338951215131]
TokenAR is a simple but effective token-level enhancement mechanism to address reference identity confusion problem.<n>Instruct Token Injection plays as a role of extra visual feature container to inject detailed and complementary priors for reference tokens.<n>The identity-token disentanglement strategy (ITD) explicitly guides the token representations toward independently representing the features of each identity.
arXiv Detail & Related papers (2025-10-18T03:36:26Z)
Identity-Preserving Text-to-Video Generation via Training-Free Prompt, Image, and Guidance Enhancement [58.85593321752693]
Identity-preserving text-to-video (IPT2V) generation creates videos faithful to both a reference subject image and a text prompt.<n>We introduce a Training-Free Prompt, Image, and Guidance Enhancement framework that bridges the semantic gap between the video description and the reference image.<n>We win first place in the ACM Multimedia 2025 Identity-Preserving Video Generation Challenge.
arXiv Detail & Related papers (2025-09-01T11:03:13Z)
RETTA: Retrieval-Enhanced Test-Time Adaptation for Zero-Shot Video Captioning [77.59074909960913]
We propose a novel zero-shot video captioning framework named Retrieval-Enhanced Test-Time Adaptation (RETTA)<n>We bridge video and text using four key models: a general video-text retrieval model XCLIP, a general image-text matching model CLIP, a text alignment model AnglE, and a text generation model GPT-2.<n>To address this problem, we propose using learnable tokens as a communication medium among these four frozen models GPT-2, XCLIP, CLIP, and AnglE.
arXiv Detail & Related papers (2024-05-11T16:22:00Z)
Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation [72.90144343056227]
We explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks. We introduce a novel framework, termed "VD-IT", tailored with dedicatedly designed components built upon a fixed T2V model. Our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-18T17:59:58Z)
DirecT2V: Large Language Models are Frame-Level Directors for Zero-Shot Text-to-Video Generation [37.25815760042241]
This paper introduces a new framework, dubbed DirecT2V, to generate text-to-video (T2V) videos. We equip a diffusion model with a novel value mapping method and dual-softmax filtering, which do not require any additional training. The experimental results validate the effectiveness of our framework in producing visually coherent and storyful videos.
arXiv Detail & Related papers (2023-05-23T17:57:09Z)
Reference-based Image and Video Super-Resolution via C2-Matching [100.0808130445653]
We propose C2-Matching, which performs explicit robust matching crossing transformation and resolution. C2-Matching significantly outperforms state of the arts on the standard CUFED5 benchmark. We also extend C2-Matching to Reference-based Video Super-Resolution task, where an image taken in a similar scene serves as the HR reference image.
arXiv Detail & Related papers (2022-12-19T16:15:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.