Related papers: Learning to Highlight Audio by Watching Movies

Learning to Highlight Audio by Watching Movies

URL: http://arxiv.org/abs/2505.12154v1
Date: Sat, 17 May 2025 22:03:57 GMT
Title: Learning to Highlight Audio by Watching Movies
Authors: Chao Huang, Ruohan Gao, J. M. F. Tsang, Jan Kurcius, Cagdas Bilen, Chenliang Xu, Anurag Kumar, Sanjeel Parekh,
Abstract summary: We introduce visually-guided acoustic highlighting, which aims to transform audio to deliver appropriate highlighting effects guided by the accompanying video.<n>To train our model, we also introduce a new dataset -- the muddy mix dataset, leveraging the meticulous audio and video crafting found in movies.<n>Our approach consistently outperforms several baselines in both quantitative and subjective evaluation.
Score: 37.9846964966927
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent years have seen a significant increase in video content creation and consumption. Crafting engaging content requires the careful curation of both visual and audio elements. While visual cue curation, through techniques like optimal viewpoint selection or post-editing, has been central to media production, its natural counterpart, audio, has not undergone equivalent advancements. This often results in a disconnect between visual and acoustic saliency. To bridge this gap, we introduce a novel task: visually-guided acoustic highlighting, which aims to transform audio to deliver appropriate highlighting effects guided by the accompanying video, ultimately creating a more harmonious audio-visual experience. We propose a flexible, transformer-based multimodal framework to solve this task. To train our model, we also introduce a new dataset -- the muddy mix dataset, leveraging the meticulous audio and video crafting found in movies, which provides a form of free supervision. We develop a pseudo-data generation process to simulate poorly mixed audio, mimicking real-world scenarios through a three-step process -- separation, adjustment, and remixing. Our approach consistently outperforms several baselines in both quantitative and subjective evaluation. We also systematically study the impact of different types of contextual guidance and difficulty levels of the dataset. Our project page is here: https://wikichao.github.io/VisAH/.

Related papers

Semantic visually-guided acoustic highlighting with large vision-language models [34.707752102338816]
Current audio mixing remains largely manual and labor-intensive.<n>It remains unclear which visual aspects are most effective as conditioning signals.<n>We identify which visual-semantic cues most strongly support coherent and visually aligned audio remixing.
arXiv Detail & Related papers (2026-01-12T01:30:15Z)
ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing [52.33281620699459]
ThinkSound is a novel framework that leverages Chain-of-Thought (CoT) reasoning to enable stepwise, interactive audio generation and editing for videos.<n>Our approach decomposes the process into three complementary stages: semantically coherent, interactive object-centric refinement through precise user interactions, and targeted editing guided by natural language instructions.<n> Experiments demonstrate that ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics.
arXiv Detail & Related papers (2025-06-26T16:32:06Z)
Language-Guided Joint Audio-Visual Editing via One-Shot Adaptation [56.92841782969847]
We introduce a novel task called language-guided joint audio-visual editing. Given an audio and image pair of a sounding event, this task aims at generating new audio-visual content by editing the given sounding event conditioned on the language guidance. We propose a new diffusion-based framework for joint audio-visual editing and introduce two key ideas.
arXiv Detail & Related papers (2024-10-09T22:02:30Z)
Video-to-Audio Generation with Hidden Alignment [27.11625918406991]
We offer insights into the video-to-audio generation paradigm, focusing on vision encoders, auxiliary embeddings, and data augmentation techniques.<n>We demonstrate our model exhibits state-of-the-art video-to-audio generation capabilities.
arXiv Detail & Related papers (2024-07-10T08:40:39Z)
Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users. Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry. In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z)
Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation [89.96013329530484]
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes. We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model. We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples.
arXiv Detail & Related papers (2023-09-28T13:26:26Z)
AdVerb: Visually Guided Audio Dereverberation [49.958724234969445]
We present AdVerb, a novel audio-visual dereverberation framework. It uses visual cues in addition to the reverberant sound to estimate clean audio.
arXiv Detail & Related papers (2023-08-23T18:20:59Z)
Exploiting Audio-Visual Consistency with Partial Supervision for Spatial Audio Generation [45.526051369551915]
We propose an audio spatialization framework to convert a monaural video into a one exploiting the relationship across audio and visual components. Experiments on benchmark datasets confirm the effectiveness of our proposed framework in both semi-supervised and fully supervised scenarios.
arXiv Detail & Related papers (2021-05-03T09:34:11Z)
Sound2Sight: Generating Visual Dynamics from Sound and Context [36.38300120482868]
We present Sound2Sight, a deep variational framework, that is trained to learn a per frame prior conditioned on a joint embedding of audio and past frames. To improve the quality and coherence of the generated frames, we propose a multimodal discriminator. Our experiments demonstrate that Sound2Sight significantly outperforms the state of the art in the generated video quality.
arXiv Detail & Related papers (2020-07-23T16:57:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.