Related papers: Ingredients: Blending Custom Photos with Video Diffusion Transformers

Ingredients: Blending Custom Photos with Video Diffusion Transformers

URL: http://arxiv.org/abs/2501.01790v1
Date: Fri, 03 Jan 2025 12:45:22 GMT
Title: Ingredients: Blending Custom Photos with Video Diffusion Transformers
Authors: Zhengcong Fei, Debang Li, Di Qiu, Changqian Yu, Mingyuan Fan,
Abstract summary: texttIngredients is a framework to customize video creations incorporating multiple specific identity (ID) photos. Our method consists of three primary modules: (textbfi) a facial extractor that captures versatile and precise facial features for each human ID from both global and local perspectives. texttIngredients demonstrates superior performance in turning custom photos into dynamic and personalized video content.
Score: 31.736838809714726
License:
Abstract: This paper presents a powerful framework to customize video creations by incorporating multiple specific identity (ID) photos, with video diffusion Transformers, referred to as \texttt{Ingredients}. Generally, our method consists of three primary modules: (\textbf{i}) a facial extractor that captures versatile and precise facial features for each human ID from both global and local perspectives; (\textbf{ii}) a multi-scale projector that maps face embeddings into the contextual space of image query in video diffusion transformers; (\textbf{iii}) an ID router that dynamically combines and allocates multiple ID embedding to the corresponding space-time regions. Leveraging a meticulously curated text-video dataset and a multi-stage training protocol, \texttt{Ingredients} demonstrates superior performance in turning custom photos into dynamic and personalized video content. Qualitative evaluations highlight the advantages of proposed method, positioning it as a significant advancement toward more effective generative video control tools in Transformer-based architecture, compared to existing methods. The data, code, and model weights are publicly available at: \url{https://github.com/feizc/Ingredients}.

Related papers

Phantom: Subject-consistent video generation via cross-modal alignment [13.067225653349901]
Phantom is a unified video generation framework for both single and multi-subject references. We emphasize subject consistency in human generation, covering existing ID-preserving video generation while offering enhanced advantages.
arXiv Detail & Related papers (2025-02-16T11:02:50Z)
BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations [82.94002870060045]
Existing video generation models struggle to follow complex text prompts and synthesize multiple objects. We develop a blob-grounded video diffusion model named BlobGEN-Vid that allows users to control object motions and fine-grained object appearance. We show that our framework is model-agnostic and build BlobGEN-Vid based on both U-Net and DiT-based video diffusion models.
arXiv Detail & Related papers (2025-01-13T19:17:06Z)
Follow-Your-MultiPose: Tuning-Free Multi-Character Text-to-Video Generation via Pose Guidance [29.768141136041454]
We propose a novel multi-character video generation framework, which is based on the separated text and pose guidance. Specifically, we first extract character masks from the pose sequence to identify the spatial position for each generating character, and then single prompts for each character are obtained with LLMs. The visualized results of generating video demonstrate the precise controllability of our method for multi-character generation.
arXiv Detail & Related papers (2024-12-21T05:49:40Z)
VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video Generation [70.61101071902596]
Current generation models excel at generating short clips but still struggle with creating multi-shot, movie-like videos. We propose VideoGen-of-Thought (VGoT), a collaborative and training-free architecture designed specifically for multi-shot video generation. Our experiments demonstrate that VGoT surpasses existing video generation methods in producing high-quality, coherent, multi-shot videos.
arXiv Detail & Related papers (2024-12-03T08:33:50Z)
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer [55.515836117658985]
We present CogVideoX, a large-scale text-to-video generation model based on diffusion transformer. It can generate 10-second continuous videos aligned with text prompt, with a frame rate of 16 fps and resolution of 768 * 1360 pixels.
arXiv Detail & Related papers (2024-08-12T11:47:11Z)
GenDeF: Learning Generative Deformation Field for Video Generation [89.49567113452396]
We propose to render a video by warping one static image with a generative deformation field (GenDeF) Such a pipeline enjoys three appealing advantages.
arXiv Detail & Related papers (2023-12-07T18:59:41Z)
Multi-entity Video Transformers for Fine-Grained Video Representation Learning [36.31020249963468]
We re-examine the design of transformer architectures for video representation learning. A salient aspect of our self-supervised method is the improved integration of spatial information in the temporal pipeline. Our Multi-entity Video Transformer (MV-Former) architecture achieves state-of-the-art results on multiple fine-grained video benchmarks.
arXiv Detail & Related papers (2023-11-17T21:23:12Z)
A Video Is Worth Three Views: Trigeminal Transformers for Video-based Person Re-identification [77.08204941207985]
Video-based person re-identification (Re-ID) aims to retrieve video sequences of the same person under non-overlapping cameras. We propose a novel framework named Trigeminal Transformers (TMT) for video-based person Re-ID.
arXiv Detail & Related papers (2021-04-05T02:50:16Z)
TediGAN: Text-Guided Diverse Face Image Generation and Manipulation [52.83401421019309]
TediGAN is a framework for multi-modal image generation and manipulation with textual descriptions. StyleGAN inversion module maps real images to the latent space of a well-trained StyleGAN. visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space. instance-level optimization is for identity preservation in manipulation.
arXiv Detail & Related papers (2020-12-06T16:20:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.