Related papers: YingSound: Video-Guided Sound Effects Generation with Multi-modal Chain-of-Thought Controls

YingSound: Video-Guided Sound Effects Generation with Multi-modal Chain-of-Thought Controls

URL: http://arxiv.org/abs/2412.09168v1
Date: Thu, 12 Dec 2024 10:55:57 GMT
Title: YingSound: Video-Guided Sound Effects Generation with Multi-modal Chain-of-Thought Controls
Authors: Zihao Chen, Haomin Zhang, Xinhan Di, Haoyu Wang, Sizhe Shan, Junjie Zheng, Yunming Liang, Yihan Fan, Xinfa Zhu, Wenjie Tian, Yihua Wang, Chaofan Ding, Lei Xie,
Abstract summary: YingSound is a foundation model designed for video-guided sound generation.<n>It supports high-quality audio generation in few-shot settings.<n>We show that YingSound effectively generates high-quality synchronized sounds through automated evaluations and human studies.
Score: 10.429203168607147
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Generating sound effects for product-level videos, where only a small amount of labeled data is available for diverse scenes, requires the production of high-quality sounds in few-shot settings. To tackle the challenge of limited labeled data in real-world scenes, we introduce YingSound, a foundation model designed for video-guided sound generation that supports high-quality audio generation in few-shot settings. Specifically, YingSound consists of two major modules. The first module uses a conditional flow matching transformer to achieve effective semantic alignment in sound generation across audio and visual modalities. This module aims to build a learnable audio-visual aggregator (AVA) that integrates high-resolution visual features with corresponding audio features at multiple stages. The second module is developed with a proposed multi-modal visual-audio chain-of-thought (CoT) approach to generate finer sound effects in few-shot settings. Finally, an industry-standard video-to-audio (V2A) dataset that encompasses various real-world scenarios is presented. We show that YingSound effectively generates high-quality synchronized sounds across diverse conditional inputs through automated evaluations and human studies. Project Page: \url{https://giantailab.github.io/yingsound/}

Related papers

SpA2V: Harnessing Spatial Auditory Cues for Audio-driven Spatially-aware Video Generation [50.03810359300705]
SpA2V decomposes the generation process into two stages: audio-guided video planning and layout-grounded video generation.<n>We show that SpA2V excels in generating realistic videos with semantic and spatial alignment to the input audios.
arXiv Detail & Related papers (2025-08-01T17:05:04Z)
ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing [52.33281620699459]
ThinkSound is a novel framework that leverages Chain-of-Thought (CoT) reasoning to enable stepwise, interactive audio generation and editing for videos.<n>Our approach decomposes the process into three complementary stages: semantically coherent, interactive object-centric refinement through precise user interactions, and targeted editing guided by natural language instructions.<n> Experiments demonstrate that ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics.
arXiv Detail & Related papers (2025-06-26T16:32:06Z)
Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation [27.20097004987987]
We propose a large-scale multimodal Video-to-Audio generation model that synthesizes high-quality audio synchronized with video content.<n>Our experiments show that Kling-Foley trained with the flow matching objective achieves new audio-visual SOTA performance.
arXiv Detail & Related papers (2025-06-24T16:39:39Z)
AudioX: Diffusion Transformer for Anything-to-Audio Generation [72.84633243365093]
AudioX is a unified Diffusion Transformer model for Anything-to-Audio and Music Generation. It can generate both general audio and music with high quality, while offering flexible natural language control. To address data scarcity, we curate two datasets: vggsound-caps with 190K audio captions based on the VGGSound dataset, and V2M-caps with 6 million music captions derived from the V2M dataset.
arXiv Detail & Related papers (2025-03-13T16:30:59Z)
UniForm: A Unified Multi-Task Diffusion Transformer for Audio-Video Generation [44.21422404659117]
UniForm is a unified multi-task diffusion transformer that jointly generates audio and visual modalities in a shared latent space. A single diffusion process models both audio and video, capturing the inherent correlations between sound and vision. By leveraging large language models and a large-scale text-audio-video combined dataset, UniForm achieves greater generative diversity than prior approaches.
arXiv Detail & Related papers (2025-02-06T09:18:30Z)
MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis [56.01110988816489]
We propose to synthesize high-quality and synchronized audio, given video and optional text conditions, using a novel multimodal joint training framework MMAudio. MMAudio is jointly trained with larger-scale, readily available text-audio data to learn to generate semantically aligned high-quality audio samples. MMAudio achieves surprisingly competitive performance in text-to-audio generation, showing that joint training does not hinder single-modality performance.
arXiv Detail & Related papers (2024-12-19T18:59:55Z)
Video-Foley: Two-Stage Video-To-Sound Generation via Temporal Event Condition For Foley Sound [6.638504164134713]
Foley sound synthesis is crucial for multimedia production, enhancing user experience by synchronizing audio and video both temporally and semantically. Recent studies on automating this labor-intensive process through video-to-sound generation face significant challenges. We propose Video-Foley, a video-to-sound system using Root Mean Square (RMS) as a temporal event condition with semantic timbre prompts.
arXiv Detail & Related papers (2024-08-21T18:06:15Z)
Read, Watch and Scream! Sound Generation from Text and Video [23.990569918960315]
We propose a novel video-and-text-to-sound generation method called ReWaS. Our method estimates the structural information of audio from the video while receiving key content cues from a user prompt. By separating the generative components of audio, it becomes a more flexible system that allows users to freely adjust the energy, surrounding environment, and primary sound source according to their preferences.
arXiv Detail & Related papers (2024-07-08T01:59:17Z)
Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization [70.13218512896032]
Generation of audio from text prompts is an important aspect of such processes in the music and film industry. Our hypothesis is focusing on how these aspects of audio generation could improve audio generation performance in the presence of limited data. We synthetically create a preference dataset where each prompt has a winner audio output and some loser audio outputs for the diffusion model to learn from.
arXiv Detail & Related papers (2024-04-15T17:31:22Z)
Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation [89.96013329530484]
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes. We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model. We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples.
arXiv Detail & Related papers (2023-09-28T13:26:26Z)
CATR: Combinatorial-Dependence Audio-Queried Transformer for Audio-Visual Video Segmentation [43.562848631392384]
Audio-visual video segmentation aims to generate pixel-level maps of sound-producing objects within image frames. We propose a decoupled audio-video dependence combining audio and video features from their respective temporal and spatial dimensions.
arXiv Detail & Related papers (2023-09-18T12:24:02Z)
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation. Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data. We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z)
MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation [70.74377373885645]
We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously. MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising process by design. Experiments show superior results in unconditional audio-video generation, and zero-shot conditional tasks.
arXiv Detail & Related papers (2022-12-19T14:11:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.