Related papers: Smooth-Foley: Creating Continuous Sound for Video-to-Audio Generation Under Semantic Guidance

Smooth-Foley: Creating Continuous Sound for Video-to-Audio Generation Under Semantic Guidance

URL: http://arxiv.org/abs/2412.18157v1
Date: Tue, 24 Dec 2024 04:29:46 GMT
Title: Smooth-Foley: Creating Continuous Sound for Video-to-Audio Generation Under Semantic Guidance
Authors: Yaoyun Zhang, Xuenan Xu, Mengyue Wu,
Abstract summary: We propose Smooth-Foley, a V2A generative model taking semantic guidance from the textual label across the generation to enhance both semantic and temporal alignment in audio.<n>A frame adapter integrates high-resolution frame-wise video features while a temporal adapter integrates temporal conditions obtained from similarities of visual frames and textual labels.<n>Results show that Smooth-Foley performs better than existing models on both continuous sound scenarios and general scenarios.
Score: 20.673800900456467
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The video-to-audio (V2A) generation task has drawn attention in the field of multimedia due to the practicality in producing Foley sound. Semantic and temporal conditions are fed to the generation model to indicate sound events and temporal occurrence. Recent studies on synthesizing immersive and synchronized audio are faced with challenges on videos with moving visual presence. The temporal condition is not accurate enough while low-resolution semantic condition exacerbates the problem. To tackle these challenges, we propose Smooth-Foley, a V2A generative model taking semantic guidance from the textual label across the generation to enhance both semantic and temporal alignment in audio. Two adapters are trained to leverage pre-trained text-to-audio generation models. A frame adapter integrates high-resolution frame-wise video features while a temporal adapter integrates temporal conditions obtained from similarities of visual frames and textual labels. The incorporation of semantic guidance from textual labels achieves precise audio-video alignment. We conduct extensive quantitative and qualitative experiments. Results show that Smooth-Foley performs better than existing models on both continuous sound scenarios and general scenarios. With semantic guidance, the audio generated by Smooth-Foley exhibits higher quality and better adherence to physical laws.

Related papers

Text2Lip: Progressive Lip-Synced Talking Face Generation from Text via Viseme-Guided Rendering [53.2204901422631]
Text2Lip is a viseme-centric framework that constructs an interpretable phonetic-visual bridge.<n>We show that Text2Lip outperforms existing approaches in semantic fidelity, visual realism, and modality robustness.
arXiv Detail & Related papers (2025-08-04T12:50:22Z)
ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing [52.33281620699459]
ThinkSound is a novel framework that leverages Chain-of-Thought (CoT) reasoning to enable stepwise, interactive audio generation and editing for videos.<n>Our approach decomposes the process into three complementary stages: semantically coherent, interactive object-centric refinement through precise user interactions, and targeted editing guided by natural language instructions.<n> Experiments demonstrate that ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics.
arXiv Detail & Related papers (2025-06-26T16:32:06Z)
SkyReels-Audio: Omni Audio-Conditioned Talking Portraits in Video Diffusion Transformers [25.36460340267922]
We present SkyReels-Audio, a unified framework for synthesizing high-fidelity and temporally coherent talking portrait videos.<n>Our framework supports infinite-length generation and editing, while enabling diverse and controllable conditioning through multimodal inputs.
arXiv Detail & Related papers (2025-06-01T04:27:13Z)
Prosody-Enhanced Acoustic Pre-training and Acoustic-Disentangled Prosody Adapting for Movie Dubbing [60.38045088180188]
We propose an acoustic-prosody disentangled two-stage method to achieve high-quality dubbing generation with precise prosody alignment. We incorporate an in-domain emotion analysis module to reduce the impact of visual domain shifts across different movies. Our method performs favorably against the state-of-the-art models on two primary benchmarks.
arXiv Detail & Related papers (2025-03-15T08:25:57Z)
FolAI: Synchronized Foley Sound Generation with Semantic and Temporal Alignment [11.796771978828403]
We introduce FolAI, a two-stage generative framework that produces temporally coherent and semantically controllable sound effects from video.<n>Results show that our model reliably produces audio that is temporally aligned with visual motion, semantically consistent with user intent, and perceptually realistic.<n>These findings highlight the potential of FolAI as a controllable and modular solution for scalable, high-quality Foley sound synthesis in professional and interactive settings.
arXiv Detail & Related papers (2024-12-19T16:37:19Z)
Video-Foley: Two-Stage Video-To-Sound Generation via Temporal Event Condition For Foley Sound [6.638504164134713]
Foley sound synthesis is crucial for multimedia production, enhancing user experience by synchronizing audio and video both temporally and semantically. Recent studies on automating this labor-intensive process through video-to-sound generation face significant challenges. We propose Video-Foley, a video-to-sound system using Root Mean Square (RMS) as a temporal event condition with semantic timbre prompts.
arXiv Detail & Related papers (2024-08-21T18:06:15Z)
FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds [14.636030346325578]
We study Neural Foley, the automatic generation of high-quality sound effects synchronizing with videos, enabling an immersive audio-visual experience. We propose FoleyCrafter, a novel framework that leverages a pre-trained text-to-audio model to ensure high-quality audio generation. One notable advantage of FoleyCrafter is its compatibility with text prompts, enabling the use of text descriptions to achieve controllable and diverse video-to-audio generation according to user intents.
arXiv Detail & Related papers (2024-07-01T17:35:56Z)
Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching [51.70360630470263]
Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video. We propose Frieren, a V2A model based on rectified flow matching. Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment.
arXiv Detail & Related papers (2024-06-01T06:40:22Z)
Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization [70.13218512896032]
Generation of audio from text prompts is an important aspect of such processes in the music and film industry. Our hypothesis is focusing on how these aspects of audio generation could improve audio generation performance in the presence of limited data. We synthetically create a preference dataset where each prompt has a winner audio output and some loser audio outputs for the diffusion model to learn from.
arXiv Detail & Related papers (2024-04-15T17:31:22Z)
Text-to-Audio Generation Synchronized with Videos [44.848393652233796]
We introduce a groundbreaking benchmark for Text-to-Audio generation that aligns with Videos, named T2AV-Bench. We also present a simple yet effective video-aligned TTA generation model, namely T2AV. It employs a temporal multi-head attention transformer to extract and understand temporal nuances from video data, a feat amplified by our Audio-Visual ControlNet.
arXiv Detail & Related papers (2024-03-08T22:27:38Z)
Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation [89.96013329530484]
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes. We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model. We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples.
arXiv Detail & Related papers (2023-09-28T13:26:26Z)
Align, Adapt and Inject: Sound-guided Unified Image Generation [50.34667929051005]
We propose a unified framework 'Align, Adapt, and Inject' (AAI) for sound-guided image generation, editing, and stylization. Our method adapts input sound into a sound token, like an ordinary word, which can plug and play with existing Text-to-Image (T2I) models. Our proposed AAI outperforms other text and sound-guided state-of-the-art methods.
arXiv Detail & Related papers (2023-06-20T12:50:49Z)
AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions. In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z)
Sound2Sight: Generating Visual Dynamics from Sound and Context [36.38300120482868]
We present Sound2Sight, a deep variational framework, that is trained to learn a per frame prior conditioned on a joint embedding of audio and past frames. To improve the quality and coherence of the generated frames, we propose a multimodal discriminator. Our experiments demonstrate that Sound2Sight significantly outperforms the state of the art in the generated video quality.
arXiv Detail & Related papers (2020-07-23T16:57:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.