HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation
- URL: http://arxiv.org/abs/2508.16930v1
- Date: Sat, 23 Aug 2025 07:30:18 GMT
- Title: HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation
- Authors: Sizhe Shan, Qiulin Li, Yutao Cui, Miles Yang, Yuehai Wang, Qun Yang, Jin Zhou, Zhao Zhong,
- Abstract summary: HunyuanVideo-Foley is an end-to-end text-video-to-audio framework.<n>It synthesizes high-fidelity audio precisely aligned with visual dynamics and semantic context.<n>It achieves new state-of-the-art performance across audio fidelity, visual-semantic alignment, temporal alignment and distribution matching.
- Score: 14.921126281071544
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in video generation produce visually realistic content, yet the absence of synchronized audio severely compromises immersion. To address key challenges in video-to-audio generation, including multimodal data scarcity, modality imbalance and limited audio quality in existing methods, we propose HunyuanVideo-Foley, an end-to-end text-video-to-audio framework that synthesizes high-fidelity audio precisely aligned with visual dynamics and semantic context. Our approach incorporates three core innovations: (1) a scalable data pipeline curating 100k-hour multimodal datasets through automated annotation; (2) a representation alignment strategy using self-supervised audio features to guide latent diffusion training, efficiently improving audio quality and generation stability; (3) a novel multimodal diffusion transformer resolving modal competition, containing dual-stream audio-video fusion through joint attention, and textual semantic injection via cross-attention. Comprehensive evaluations demonstrate that HunyuanVideo-Foley achieves new state-of-the-art performance across audio fidelity, visual-semantic alignment, temporal alignment and distribution matching. The demo page is available at: https://szczesnys.github.io/hunyuanvideo-foley/.
Related papers
- LTX-2: Efficient Joint Audio-Visual Foundation Model [3.1804093402153506]
LTX-2 is an open-source model capable of generating temporally synchronized audiovisual content.<n>We employ a multilingual text encoder for broader prompt understanding.<n>LTX-2 produces rich, coherent audio tracks that follow the characters, environment, style, and emotion of each scene.
arXiv Detail & Related papers (2026-01-06T18:24:41Z) - ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation [55.76423101183408]
ViSAudio is an end-to-end framework that employs conditional flow matching with a dual-branch audio generation architecture.<n>It generates high-quality audio with spatial immersion that adapts to viewpoint changes, sound-source motion, and diverse acoustic environments.
arXiv Detail & Related papers (2025-12-02T18:56:12Z) - Training-Free Multimodal Guidance for Video to Audio Generation [22.64037676707457]
Video-to-audio (V2A) generation aims to synthesize realistic and semantically aligned audio from silent videos.<n>Existing approaches either require costly joint training on large-scale paired datasets or rely on pairwise similarities.<n>We propose a novel training-free multimodal guidance mechanism for V2A diffusion that leverages the volume spanned by the modality embeddings to enforce unified alignment.
arXiv Detail & Related papers (2025-09-29T10:00:36Z) - StableAvatar: Infinite-Length Audio-Driven Avatar Video Generation [91.45910771331741]
Current diffusion models for audio-driven avatar video generation struggle to synthesize long videos with natural audio synchronization and identity consistency.<n>This paper presents StableAvatar, the first end-to-end video diffusion transformer that synthesizes infinite-length high-quality videos without post-processing.
arXiv Detail & Related papers (2025-08-11T17:58:24Z) - AudioGen-Omni: A Unified Multimodal Diffusion Transformer for Video-Synchronized Audio, Speech, and Song Generation [24.799628787198397]
AudioGen- Omni generates high-fidelity audio, speech, and song coherently synchronized with the input video.<n>Joint training paradigm integrates large-scale video-text-audio corpora.<n>Dense frame-level representations are fused using an AdaLN-based joint attention mechanism.<n>With an inference time of 1.91 seconds for 8 seconds of audio, it offers substantial improvements in both efficiency and generality.
arXiv Detail & Related papers (2025-08-01T16:03:57Z) - ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing [52.33281620699459]
ThinkSound is a novel framework that leverages Chain-of-Thought (CoT) reasoning to enable stepwise, interactive audio generation and editing for videos.<n>Our approach decomposes the process into three complementary stages: semantically coherent, interactive object-centric refinement through precise user interactions, and targeted editing guided by natural language instructions.<n> Experiments demonstrate that ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics.
arXiv Detail & Related papers (2025-06-26T16:32:06Z) - Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation [27.20097004987987]
We propose a large-scale multimodal Video-to-Audio generation model that synthesizes high-quality audio synchronized with video content.<n>Our experiments show that Kling-Foley trained with the flow matching objective achieves new audio-visual SOTA performance.
arXiv Detail & Related papers (2025-06-24T16:39:39Z) - AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation [65.06374691172061]
multimodal-to-speech task has gained increasing attention due to its wide range of applications, such as film production, dubbing, and virtual avatars.<n>Existing methods still suffer from limitations in speech intelligibility, audio-video synchronization, speech naturalness, and voice similarity to the reference speaker.<n>We propose AlignDiT, a multimodal Aligned Diffusion Transformer that generates accurate, synchronized, and natural-sounding speech from aligned multimodal inputs.
arXiv Detail & Related papers (2025-04-29T10:56:24Z) - Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion
Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users.
Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry.
In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z) - Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model
Adaptation [89.96013329530484]
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes.
We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model.
We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples.
arXiv Detail & Related papers (2023-09-28T13:26:26Z) - MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and
Video Generation [70.74377373885645]
We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously.
MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising process by design.
Experiments show superior results in unconditional audio-video generation, and zero-shot conditional tasks.
arXiv Detail & Related papers (2022-12-19T14:11:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.