HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation
- URL: http://arxiv.org/abs/2505.04512v2
- Date: Thu, 08 May 2025 08:29:00 GMT
- Title: HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation
- Authors: Teng Hu, Zhentao Yu, Zhengguang Zhou, Sen Liang, Yuan Zhou, Qin Lin, Qinglin Lu,
- Abstract summary: HunyuanCustom is a customized video generation framework that emphasizes subject consistency while supporting image, audio, video, and text conditions.<n>Our results highlight the effectiveness of multi-modal conditioning and identity-preserving strategies in advancing controllable video generation.
- Score: 10.037480577373161
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Customized video generation aims to produce videos featuring specific subjects under flexible user-defined conditions, yet existing methods often struggle with identity consistency and limited input modalities. In this paper, we propose HunyuanCustom, a multi-modal customized video generation framework that emphasizes subject consistency while supporting image, audio, video, and text conditions. Built upon HunyuanVideo, our model first addresses the image-text conditioned generation task by introducing a text-image fusion module based on LLaVA for enhanced multi-modal understanding, along with an image ID enhancement module that leverages temporal concatenation to reinforce identity features across frames. To enable audio- and video-conditioned generation, we further propose modality-specific condition injection mechanisms: an AudioNet module that achieves hierarchical alignment via spatial cross-attention, and a video-driven injection module that integrates latent-compressed conditional video through a patchify-based feature-alignment network. Extensive experiments on single- and multi-subject scenarios demonstrate that HunyuanCustom significantly outperforms state-of-the-art open- and closed-source methods in terms of ID consistency, realism, and text-video alignment. Moreover, we validate its robustness across downstream tasks, including audio and video-driven customized video generation. Our results highlight the effectiveness of multi-modal conditioning and identity-preserving strategies in advancing controllable video generation. All the code and models are available at https://hunyuancustom.github.io.
Related papers
- PolyVivid: Vivid Multi-Subject Video Generation with Cross-Modal Interaction and Enhancement [26.89021788485701]
PolyVivid is a multi-subject video customization framework that enables flexible and identity-consistent generation.<n>In experiments, PolyVivid achieves superior performance in identity fidelity, video realism, and subject alignment, outperforming existing open-source and commercial baselines.
arXiv Detail & Related papers (2025-06-09T15:11:09Z) - MAGREF: Masked Guidance for Any-Reference Video Generation [33.35245169242822]
MAGREF is a unified framework for any-reference video generation.<n>We propose a region-aware dynamic masking mechanism that enables a single model to flexibly handle various subject inference.<n>Our model delivers state-of-the-art video generation quality, generalizing from single-subject training to complex multi-subject scenarios.
arXiv Detail & Related papers (2025-05-29T17:58:15Z) - CINEMA: Coherent Multi-Subject Video Generation via MLLM-Based Guidance [34.345125922868]
We propose CINEMA, a novel framework for coherent multi-subject video generation by leveraging Multimodal Large Language Model (MLLM)<n>Our approach eliminates the need for explicit correspondences between subject images and text entities, mitigating ambiguity and reducing annotation effort.<n>Our framework can be conditioned on varying numbers of subjects, offering greater flexibility in personalized content creation.
arXiv Detail & Related papers (2025-03-13T14:07:58Z) - Phantom: Subject-consistent video generation via cross-modal alignment [16.777805813950486]
We propose a unified video generation framework for both single- and multi-subject references.<n>The proposed method achieves high-fidelity subject-consistent video generation while addressing issues of image content leakage and multi-subject confusion.
arXiv Detail & Related papers (2025-02-16T11:02:50Z) - StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation [117.13475564834458]
We propose a new way of self-attention calculation, termed Consistent Self-Attention.
To extend our method to long-range video generation, we introduce a novel semantic space temporal motion prediction module.
By merging these two novel components, our framework, referred to as StoryDiffusion, can describe a text-based story with consistent images or videos.
arXiv Detail & Related papers (2024-05-02T16:25:16Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects [61.323597069037056]
Current approaches for personalizing text-to-video generation suffer from tackling multiple subjects.
We propose CustomVideo, a novel framework that can generate identity-preserving videos with the guidance of multiple subjects.
arXiv Detail & Related papers (2024-01-18T13:23:51Z) - Moonshot: Towards Controllable Video Generation and Editing with
Multimodal Conditions [94.03133100056372]
Moonshot is a new video generation model that conditions simultaneously on multimodal inputs of image and text.
Model can be easily repurposed for a variety of generative applications, such as personalized video generation, image animation and video editing.
arXiv Detail & Related papers (2024-01-03T16:43:47Z) - MagDiff: Multi-Alignment Diffusion for High-Fidelity Video Generation and Editing [90.06041718086317]
We propose a unified Multi-alignment Diffusion, dubbed as MagDiff, for both tasks of high-fidelity video generation and editing.
The proposed MagDiff introduces three types of alignments, including subject-driven alignment, adaptive prompts alignment, and high-fidelity alignment.
arXiv Detail & Related papers (2023-11-29T03:36:07Z) - VideoDreamer: Customized Multi-Subject Text-to-Video Generation with Disen-Mix Finetuning on Language-Video Foundation Models [43.46536102838717]
VideoDreamer is a novel framework for customized multi-subject text-to-video generation.<n>It can generate temporally consistent text-guided videos that faithfully preserve the visual features of the given multiple subjects.
arXiv Detail & Related papers (2023-11-02T04:38:50Z) - Multi-object Video Generation from Single Frame Layouts [84.55806837855846]
We propose a video generative framework capable of synthesizing global scenes with local objects.
Our framework is a non-trivial adaptation from image generation methods, and is new to this field.
Our model has been evaluated on two widely-used video recognition benchmarks.
arXiv Detail & Related papers (2023-05-06T09:07:01Z) - VX2TEXT: End-to-End Learning of Video-Based Text Generation From
Multimodal Inputs [103.99315770490163]
We present a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio.
Experiments demonstrate that our approach based on a single architecture outperforms the state-of-the-art on three video-based text-generation tasks.
arXiv Detail & Related papers (2021-01-28T15:22:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.