Related papers: Video Object Segmentation-Aware Audio Generation

Video Object Segmentation-Aware Audio Generation

URL: http://arxiv.org/abs/2509.26604v1
Date: Tue, 30 Sep 2025 17:49:41 GMT
Title: Video Object Segmentation-Aware Audio Generation
Authors: Ilpo Viertola, Vladimir Iashin, Esa Rahtu,
Abstract summary: Existing multimodal audio generation models often lack precise user control, which limits their applicability in professional Foley.<n>We present SAGANet, a new multimodal generative model that enables controllable audio generation by leveraging visual segmentation masks along with video and textual cues.<n>Our method demonstrates substantial improvements over current state-of-the-art methods and sets a new standard for controllable, high-fidelity Foley synthesis.
Score: 13.505371291069892
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing multimodal audio generation models often lack precise user control, which limits their applicability in professional Foley workflows. In particular, these models focus on the entire video and do not provide precise methods for prioritizing a specific object within a scene, generating unnecessary background sounds, or focusing on the wrong objects. To address this gap, we introduce the novel task of video object segmentation-aware audio generation, which explicitly conditions sound synthesis on object-level segmentation maps. We present SAGANet, a new multimodal generative model that enables controllable audio generation by leveraging visual segmentation masks along with video and textual cues. Our model provides users with fine-grained and visually localized control over audio generation. To support this task and further research on segmentation-aware Foley, we propose Segmented Music Solos, a benchmark dataset of musical instrument performance videos with segmentation information. Our method demonstrates substantial improvements over current state-of-the-art methods and sets a new standard for controllable, high-fidelity Foley synthesis. Code, samples, and Segmented Music Solos are available at https://saganet.notion.site

Related papers

Audio Does Matter: Importance-Aware Multi-Granularity Fusion for Video Moment Retrieval [58.640807985155554]
Video Moment Retrieval (VMR) aims to retrieve a specific moment semantically related to a given query.<n>Most existing VMR methods solely focus on the visual and textual modalities while neglecting the complementary but important audio modality.<n>We propose a novel Importance-aware Multi-Granularity fusion model (IMG), which learns to dynamically and selectively aggregate the audio-vision-text contexts for VMR.
arXiv Detail & Related papers (2025-08-06T09:58:43Z)
Hear-Your-Click: Interactive Object-Specific Video-to-Audio Generation [6.631248829195371]
We introduce Hear-Your-Click, an interactive V2A framework enabling users to generate sounds for specific objects by clicking on the frame.<n>To achieve this, we propose Object-aware Contrastive Audio-Visual Fine-tuning (OCAV) with a Mask-guided Visual (MVE) to obtain object-level visual features aligned with audio.<n>To measure audio-visual correspondence, we designed a new evaluation metric, the CAV score.
arXiv Detail & Related papers (2025-07-07T13:01:50Z)
Revisiting Audio-Visual Segmentation with Vision-Centric Transformer [60.83798235788669]
Audio-Visual (AVS) aims to segment sound-producing objects in video frames based on the associated audio signal.<n>We propose a new Vision-Centric Transformer framework that leverages vision-derived queries to iteratively fetch corresponding audio and visual information.<n>Our framework achieves new state-of-the-art performances on three subsets of the AVSBench dataset.
arXiv Detail & Related papers (2025-06-30T08:40:36Z)
ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing [52.33281620699459]
ThinkSound is a novel framework that leverages Chain-of-Thought (CoT) reasoning to enable stepwise, interactive audio generation and editing for videos.<n>Our approach decomposes the process into three complementary stages: semantically coherent, interactive object-centric refinement through precise user interactions, and targeted editing guided by natural language instructions.<n> Experiments demonstrate that ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics.
arXiv Detail & Related papers (2025-06-26T16:32:06Z)
ReelWave: Multi-Agentic Movie Sound Generation through Multimodal LLM Conversation [72.22243595269389]
This paper proposes a multi-agentic framework for audio generation supervised by an autonomous Sound Director agent.<n>The Foley Artist works cooperatively with the Composer and Voice Actor agents, and together they autonomously generate off-screen sound to complement the overall production.<n>Our framework can generate rich and relevant audio content conditioned on video clips extracted from movies.
arXiv Detail & Related papers (2025-03-10T11:57:55Z)
Synthesizing Audio from Silent Video using Sequence to Sequence Modeling [0.0]
We propose a novel method to generate audio from video using a sequence-to-sequence model. Our approach employs a 3D Vector Quantized Variational Autoencoder (VQ-VAE) to capture the video's spatial and temporal structures. Our model aims to enhance applications like CCTV footage analysis, silent movie restoration, and video generation models.
arXiv Detail & Related papers (2024-04-25T22:19:42Z)
Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users. Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry. In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z)
Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals. Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars. Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z)
Audio-Visual Instance Segmentation [14.10809424760213]
We propose a new multi-modal task, termed audio-visual instance segmentation (AVIS)<n>AVIS aims to simultaneously identify, segment and track individual sounding object instances in audible videos.<n>We introduce a high-quality benchmark named AVISeg, containing over 90K instance masks from 26 semantic categories in 926 long videos.
arXiv Detail & Related papers (2023-10-28T13:37:52Z)
Audio-aware Query-enhanced Transformer for Audio-Visual Segmentation [22.28510611697998]
We propose a novel textbfAudio-aware query-enhanced textbfTRansformer (AuTR) to tackle the task. Unlike existing methods, our approach introduces a multimodal transformer architecture that enables deep fusion and aggregation of audio-visual features.
arXiv Detail & Related papers (2023-07-25T03:59:04Z)
Object Segmentation with Audio Context [0.5243460995467893]
This project explores the multimodal feature aggregation for video instance segmentation task. We integrate audio features into our video segmentation model to conduct an audio-visual learning scheme.
arXiv Detail & Related papers (2023-01-04T01:33:42Z)
Lets Play Music: Audio-driven Performance Video Generation [58.77609661515749]
We propose a new task named Audio-driven Per-formance Video Generation (APVG) APVG aims to synthesize the video of a person playing a certain instrument guided by a given music audio clip.
arXiv Detail & Related papers (2020-11-05T03:13:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.