Ingredients: Blending Custom Photos with Video Diffusion Transformers
- URL: http://arxiv.org/abs/2501.01790v2
- Date: Tue, 18 Mar 2025 10:47:27 GMT
- Title: Ingredients: Blending Custom Photos with Video Diffusion Transformers
- Authors: Zhengcong Fei, Debang Li, Di Qiu, Changqian Yu, Mingyuan Fan,
- Abstract summary: Ingredients is a framework to customize video creations incorporating multiple specific identity (ID) photos.<n>It consists of three primary modules: (i) a facial extractor that captures versatile and precise facial features for each human ID from both global and local perspectives; (ii) a multi-scale projector that maps face embeddings into the contextual space of image query in video diffusion transformers; (iii) an ID router that dynamically combines and allocates multiple ID embedding to the corresponding space-time regions.
- Score: 31.736838809714726
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a powerful framework to customize video creations by incorporating multiple specific identity (ID) photos, with video diffusion Transformers, referred to as Ingredients. Generally, our method consists of three primary modules: (i) a facial extractor that captures versatile and precise facial features for each human ID from both global and local perspectives; (ii) a multi-scale projector that maps face embeddings into the contextual space of image query in video diffusion transformers; (iii) an ID router that dynamically combines and allocates multiple ID embedding to the corresponding space-time regions. Leveraging a meticulously curated text-video dataset and a multi-stage training protocol, Ingredients demonstrates superior performance in turning custom photos into dynamic and personalized video content. Qualitative evaluations highlight the advantages of proposed method, positioning it as a significant advancement toward more effective generative video control tools in Transformer-based architecture, compared to existing methods. The data, code, and model weights are publicly available at: https://github.com/feizc/Ingredients.
Related papers
- Reangle-A-Video: 4D Video Generation as Video-to-Video Translation [51.328567400947435]
We introduce Reangle-A-Video, a unified framework for generating synchronized multi-view videos from a single input video.
Our method reframes the multi-view video generation task as video-to-videos translation, leveraging publicly available image and video diffusion priors.
arXiv Detail & Related papers (2025-03-12T08:26:15Z) - Get In Video: Add Anything You Want to the Video [48.06070610416688]
Video editing increasingly demands the ability to incorporate specific real-world instances into existing footage.
Current approaches fail to capture the unique visual characteristics of particular subjects and ensure natural instance/scene interactions.
We introduce "Get-In-Video Editing", where users provide reference images to precisely specify visual elements they wish to incorporate into videos.
arXiv Detail & Related papers (2025-03-08T16:27:53Z) - Phantom: Subject-consistent video generation via cross-modal alignment [16.777805813950486]
We propose a unified video generation framework for both single- and multi-subject references.
The proposed method achieves high-fidelity subject-consistent video generation while addressing issues of image content leakage and multi-subject confusion.
arXiv Detail & Related papers (2025-02-16T11:02:50Z) - CustomVideoX: 3D Reference Attention Driven Dynamic Adaptation for Zero-Shot Customized Video Diffusion Transformers [15.558659099600822]
CustomVideoX capitalizes on pre-trained video networks by exclusively training the LoRA parameters to extract reference features.
We propose 3D Reference Attention, which enables direct and simultaneous engagement of reference image features.
Experimental results show that CustomVideoX significantly outperforms existing methods in terms of video consistency and quality.
arXiv Detail & Related papers (2025-02-10T14:50:32Z) - Follow-Your-MultiPose: Tuning-Free Multi-Character Text-to-Video Generation via Pose Guidance [29.768141136041454]
We propose a novel multi-character video generation framework, which is based on the separated text and pose guidance.<n>Specifically, we first extract character masks from the pose sequence to identify the spatial position for each generating character, and then single prompts for each character are obtained with LLMs.<n>The visualized results of generating video demonstrate the precise controllability of our method for multi-character generation.
arXiv Detail & Related papers (2024-12-21T05:49:40Z) - VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video Generation [70.61101071902596]
Current generation models excel at generating short clips but still struggle with creating multi-shot, movie-like videos.<n>We propose VideoGen-of-Thought (VGoT), a collaborative and training-free architecture designed specifically for multi-shot video generation.<n>Our experiments demonstrate that VGoT surpasses existing video generation methods in producing high-quality, coherent, multi-shot videos.
arXiv Detail & Related papers (2024-12-03T08:33:50Z) - TweedieMix: Improving Multi-Concept Fusion for Diffusion-based Image/Video Generation [67.97044071594257]
TweedieMix is a novel method for composing customized diffusion models.
Our framework can be effortlessly extended to image-to-video diffusion models.
arXiv Detail & Related papers (2024-10-08T01:06:01Z) - SITAR: Semi-supervised Image Transformer for Action Recognition [20.609596080624662]
This paper addresses video action recognition in a semi-supervised setting by leveraging only a handful of labeled videos.
We capitalize on the vast pool of unlabeled samples and employ contrastive learning on the encoded super images.
Our method demonstrates superior performance compared to existing state-of-the-art approaches for semi-supervised action recognition.
arXiv Detail & Related papers (2024-09-04T17:49:54Z) - VideoStudio: Generating Consistent-Content and Multi-Scene Videos [88.88118783892779]
VideoStudio is a framework for consistent-content and multi-scene video generation.
VideoStudio leverages Large Language Models (LLM) to convert the input prompt into comprehensive multi-scene script.
VideoStudio outperforms the SOTA video generation models in terms of visual quality, content consistency, and user preference.
arXiv Detail & Related papers (2024-01-02T15:56:48Z) - GenDeF: Learning Generative Deformation Field for Video Generation [89.49567113452396]
We propose to render a video by warping one static image with a generative deformation field (GenDeF)
Such a pipeline enjoys three appealing advantages.
arXiv Detail & Related papers (2023-12-07T18:59:41Z) - Multi-entity Video Transformers for Fine-Grained Video Representation
Learning [36.31020249963468]
We re-examine the design of transformer architectures for video representation learning.
A salient aspect of our self-supervised method is the improved integration of spatial information in the temporal pipeline.
Our Multi-entity Video Transformer (MV-Former) architecture achieves state-of-the-art results on multiple fine-grained video benchmarks.
arXiv Detail & Related papers (2023-11-17T21:23:12Z) - CoDeF: Content Deformation Fields for Temporally Consistent Video Processing [86.25225894085105]
CoDeF is a new type of video representation consisting of a canonical content field and a temporal deformation field.<n>We experimentally show that CoDeF is able to lift image-to-image translation to video-to-video translation and lift keypoint detection to keypoint tracking without any training.
arXiv Detail & Related papers (2023-08-15T17:59:56Z) - InstructVid2Vid: Controllable Video Editing with Natural Language Instructions [97.17047888215284]
InstructVid2Vid is an end-to-end diffusion-based methodology for video editing guided by human language instructions.
Our approach empowers video manipulation guided by natural language directives, eliminating the need for per-example fine-tuning or inversion.
arXiv Detail & Related papers (2023-05-21T03:28:13Z) - Hierarchical Multimodal Transformer to Summarize Videos [103.47766795086206]
Motivated by the great success of transformer and the natural structure of video (frame-shot-video), a hierarchical transformer is developed for video summarization.
To integrate the two kinds of information, they are encoded in a two-stream scheme, and a multimodal fusion mechanism is developed based on the hierarchical transformer.
Practically, extensive experiments show that HMT surpasses most of the traditional, RNN-based and attention-based video summarization methods.
arXiv Detail & Related papers (2021-09-22T07:38:59Z) - A Video Is Worth Three Views: Trigeminal Transformers for Video-based
Person Re-identification [77.08204941207985]
Video-based person re-identification (Re-ID) aims to retrieve video sequences of the same person under non-overlapping cameras.
We propose a novel framework named Trigeminal Transformers (TMT) for video-based person Re-ID.
arXiv Detail & Related papers (2021-04-05T02:50:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.