Related papers: HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation

HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation

URL: http://arxiv.org/abs/2508.16930v1
Date: Sat, 23 Aug 2025 07:30:18 GMT
Title: HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation
Authors: Sizhe Shan, Qiulin Li, Yutao Cui, Miles Yang, Yuehai Wang, Qun Yang, Jin Zhou, Zhao Zhong,
Abstract summary: HunyuanVideo-Foley is an end-to-end text-video-to-audio framework.<n>It synthesizes high-fidelity audio precisely aligned with visual dynamics and semantic context.<n>It achieves new state-of-the-art performance across audio fidelity, visual-semantic alignment, temporal alignment and distribution matching.
Score: 14.921126281071544
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in video generation produce visually realistic content, yet the absence of synchronized audio severely compromises immersion. To address key challenges in video-to-audio generation, including multimodal data scarcity, modality imbalance and limited audio quality in existing methods, we propose HunyuanVideo-Foley, an end-to-end text-video-to-audio framework that synthesizes high-fidelity audio precisely aligned with visual dynamics and semantic context. Our approach incorporates three core innovations: (1) a scalable data pipeline curating 100k-hour multimodal datasets through automated annotation; (2) a representation alignment strategy using self-supervised audio features to guide latent diffusion training, efficiently improving audio quality and generation stability; (3) a novel multimodal diffusion transformer resolving modal competition, containing dual-stream audio-video fusion through joint attention, and textual semantic injection via cross-attention. Comprehensive evaluations demonstrate that HunyuanVideo-Foley achieves new state-of-the-art performance across audio fidelity, visual-semantic alignment, temporal alignment and distribution matching. The demo page is available at: https://szczesnys.github.io/hunyuanvideo-foley/.

Related papers

LTX-2: Efficient Joint Audio-Visual Foundation Model [3.1804093402153506]
LTX-2 is an open-source model capable of generating temporally synchronized audiovisual content.<n>We employ a multilingual text encoder for broader prompt understanding.<n>LTX-2 produces rich, coherent audio tracks that follow the characters, environment, style, and emotion of each scene.
arXiv Detail & Related papers (2026-01-06T18:24:41Z)
ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation [55.76423101183408]
ViSAudio is an end-to-end framework that employs conditional flow matching with a dual-branch audio generation architecture.<n>It generates high-quality audio with spatial immersion that adapts to viewpoint changes, sound-source motion, and diverse acoustic environments.
arXiv Detail & Related papers (2025-12-02T18:56:12Z)
Training-Free Multimodal Guidance for Video to Audio Generation [22.64037676707457]
Video-to-audio (V2A) generation aims to synthesize realistic and semantically aligned audio from silent videos.<n>Existing approaches either require costly joint training on large-scale paired datasets or rely on pairwise similarities.<n>We propose a novel training-free multimodal guidance mechanism for V2A diffusion that leverages the volume spanned by the modality embeddings to enforce unified alignment.
arXiv Detail & Related papers (2025-09-29T10:00:36Z)
StableAvatar: Infinite-Length Audio-Driven Avatar Video Generation [91.45910771331741]
Current diffusion models for audio-driven avatar video generation struggle to synthesize long videos with natural audio synchronization and identity consistency.<n>This paper presents StableAvatar, the first end-to-end video diffusion transformer that synthesizes infinite-length high-quality videos without post-processing.
arXiv Detail & Related papers (2025-08-11T17:58:24Z)
AudioGen-Omni: A Unified Multimodal Diffusion Transformer for Video-Synchronized Audio, Speech, and Song Generation [24.799628787198397]
AudioGen- Omni generates high-fidelity audio, speech, and song coherently synchronized with the input video.<n>Joint training paradigm integrates large-scale video-text-audio corpora.<n>Dense frame-level representations are fused using an AdaLN-based joint attention mechanism.<n>With an inference time of 1.91 seconds for 8 seconds of audio, it offers substantial improvements in both efficiency and generality.
arXiv Detail & Related papers (2025-08-01T16:03:57Z)
ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing [52.33281620699459]
ThinkSound is a novel framework that leverages Chain-of-Thought (CoT) reasoning to enable stepwise, interactive audio generation and editing for videos.<n>Our approach decomposes the process into three complementary stages: semantically coherent, interactive object-centric refinement through precise user interactions, and targeted editing guided by natural language instructions.<n> Experiments demonstrate that ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics.
arXiv Detail & Related papers (2025-06-26T16:32:06Z)
Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation [27.20097004987987]
We propose a large-scale multimodal Video-to-Audio generation model that synthesizes high-quality audio synchronized with video content.<n>Our experiments show that Kling-Foley trained with the flow matching objective achieves new audio-visual SOTA performance.
arXiv Detail & Related papers (2025-06-24T16:39:39Z)
AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation [65.06374691172061]
multimodal-to-speech task has gained increasing attention due to its wide range of applications, such as film production, dubbing, and virtual avatars.<n>Existing methods still suffer from limitations in speech intelligibility, audio-video synchronization, speech naturalness, and voice similarity to the reference speaker.<n>We propose AlignDiT, a multimodal Aligned Diffusion Transformer that generates accurate, synchronized, and natural-sounding speech from aligned multimodal inputs.
arXiv Detail & Related papers (2025-04-29T10:56:24Z)
Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users. Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry. In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z)
Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation [89.96013329530484]
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes. We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model. We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples.
arXiv Detail & Related papers (2023-09-28T13:26:26Z)
MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation [70.74377373885645]
We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously. MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising process by design. Experiments show superior results in unconditional audio-video generation, and zero-shot conditional tasks.
arXiv Detail & Related papers (2022-12-19T14:11:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.