Related papers: HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation

HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation

URL: http://arxiv.org/abs/2505.04512v2
Date: Thu, 08 May 2025 08:29:00 GMT
Title: HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation
Authors: Teng Hu, Zhentao Yu, Zhengguang Zhou, Sen Liang, Yuan Zhou, Qin Lin, Qinglin Lu,
Abstract summary: HunyuanCustom is a customized video generation framework that emphasizes subject consistency while supporting image, audio, video, and text conditions.<n>Our results highlight the effectiveness of multi-modal conditioning and identity-preserving strategies in advancing controllable video generation.
Score: 10.037480577373161
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Customized video generation aims to produce videos featuring specific subjects under flexible user-defined conditions, yet existing methods often struggle with identity consistency and limited input modalities. In this paper, we propose HunyuanCustom, a multi-modal customized video generation framework that emphasizes subject consistency while supporting image, audio, video, and text conditions. Built upon HunyuanVideo, our model first addresses the image-text conditioned generation task by introducing a text-image fusion module based on LLaVA for enhanced multi-modal understanding, along with an image ID enhancement module that leverages temporal concatenation to reinforce identity features across frames. To enable audio- and video-conditioned generation, we further propose modality-specific condition injection mechanisms: an AudioNet module that achieves hierarchical alignment via spatial cross-attention, and a video-driven injection module that integrates latent-compressed conditional video through a patchify-based feature-alignment network. Extensive experiments on single- and multi-subject scenarios demonstrate that HunyuanCustom significantly outperforms state-of-the-art open- and closed-source methods in terms of ID consistency, realism, and text-video alignment. Moreover, we validate its robustness across downstream tasks, including audio and video-driven customized video generation. Our results highlight the effectiveness of multi-modal conditioning and identity-preserving strategies in advancing controllable video generation. All the code and models are available at https://hunyuancustom.github.io.

Related papers

PolyVivid: Vivid Multi-Subject Video Generation with Cross-Modal Interaction and Enhancement [26.89021788485701]
PolyVivid is a multi-subject video customization framework that enables flexible and identity-consistent generation.<n>In experiments, PolyVivid achieves superior performance in identity fidelity, video realism, and subject alignment, outperforming existing open-source and commercial baselines.
arXiv Detail & Related papers (2025-06-09T15:11:09Z)
MAGREF: Masked Guidance for Any-Reference Video Generation [33.35245169242822]
MAGREF is a unified framework for any-reference video generation.<n>We propose a region-aware dynamic masking mechanism that enables a single model to flexibly handle various subject inference.<n>Our model delivers state-of-the-art video generation quality, generalizing from single-subject training to complex multi-subject scenarios.
arXiv Detail & Related papers (2025-05-29T17:58:15Z)
CINEMA: Coherent Multi-Subject Video Generation via MLLM-Based Guidance [34.345125922868]
We propose CINEMA, a novel framework for coherent multi-subject video generation by leveraging Multimodal Large Language Model (MLLM)<n>Our approach eliminates the need for explicit correspondences between subject images and text entities, mitigating ambiguity and reducing annotation effort.<n>Our framework can be conditioned on varying numbers of subjects, offering greater flexibility in personalized content creation.
arXiv Detail & Related papers (2025-03-13T14:07:58Z)
Phantom: Subject-consistent video generation via cross-modal alignment [16.777805813950486]
We propose a unified video generation framework for both single- and multi-subject references.<n>The proposed method achieves high-fidelity subject-consistent video generation while addressing issues of image content leakage and multi-subject confusion.
arXiv Detail & Related papers (2025-02-16T11:02:50Z)
StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation [117.13475564834458]
We propose a new way of self-attention calculation, termed Consistent Self-Attention. To extend our method to long-range video generation, we introduce a novel semantic space temporal motion prediction module. By merging these two novel components, our framework, referred to as StoryDiffusion, can describe a text-based story with consistent images or videos.
arXiv Detail & Related papers (2024-05-02T16:25:16Z)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video. In this paper, we address such limitations in video pre-training with an efficient video decomposition. Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z)
CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects [61.323597069037056]
Current approaches for personalizing text-to-video generation suffer from tackling multiple subjects. We propose CustomVideo, a novel framework that can generate identity-preserving videos with the guidance of multiple subjects.
arXiv Detail & Related papers (2024-01-18T13:23:51Z)
Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions [94.03133100056372]
Moonshot is a new video generation model that conditions simultaneously on multimodal inputs of image and text. Model can be easily repurposed for a variety of generative applications, such as personalized video generation, image animation and video editing.
arXiv Detail & Related papers (2024-01-03T16:43:47Z)
MagDiff: Multi-Alignment Diffusion for High-Fidelity Video Generation and Editing [90.06041718086317]
We propose a unified Multi-alignment Diffusion, dubbed as MagDiff, for both tasks of high-fidelity video generation and editing. The proposed MagDiff introduces three types of alignments, including subject-driven alignment, adaptive prompts alignment, and high-fidelity alignment.
arXiv Detail & Related papers (2023-11-29T03:36:07Z)
VideoDreamer: Customized Multi-Subject Text-to-Video Generation with Disen-Mix Finetuning on Language-Video Foundation Models [43.46536102838717]
VideoDreamer is a novel framework for customized multi-subject text-to-video generation.<n>It can generate temporally consistent text-guided videos that faithfully preserve the visual features of the given multiple subjects.
arXiv Detail & Related papers (2023-11-02T04:38:50Z)
Multi-object Video Generation from Single Frame Layouts [84.55806837855846]
We propose a video generative framework capable of synthesizing global scenes with local objects. Our framework is a non-trivial adaptation from image generation methods, and is new to this field. Our model has been evaluated on two widely-used video recognition benchmarks.
arXiv Detail & Related papers (2023-05-06T09:07:01Z)
VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs [103.99315770490163]
We present a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio. Experiments demonstrate that our approach based on a single architecture outperforms the state-of-the-art on three video-based text-generation tasks.
arXiv Detail & Related papers (2021-01-28T15:22:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.