Related papers: AV-Edit: Multimodal Generative Sound Effect Editing via Audio-Visual Semantic Joint Control

AV-Edit: Multimodal Generative Sound Effect Editing via Audio-Visual Semantic Joint Control

URL: http://arxiv.org/abs/2511.21146v1
Date: Wed, 26 Nov 2025 07:59:53 GMT
Title: AV-Edit: Multimodal Generative Sound Effect Editing via Audio-Visual Semantic Joint Control
Authors: Xinyue Guo, Xiaoran Yang, Lipan Zhang, Jianxuan Yang, Zhao Wang, Jian Luan,
Abstract summary: AV-Edit is a generative sound effect editing framework that enables fine-grained editing of existing audio tracks in videos.<n>The proposed method employs a specially designed contrastive audio-visual masking autoencoder (CAV-MAE-Edit) for multimodal pre-training.<n> Experiments demonstrate that the proposed AV-Edit generates high-quality audio with precise modifications based on visual content.
Score: 10.55114688654566
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Sound effect editing-modifying audio by adding, removing, or replacing elements-remains constrained by existing approaches that rely solely on low-level signal processing or coarse text prompts, often resulting in limited flexibility and suboptimal audio quality. To address this, we propose AV-Edit, a generative sound effect editing framework that enables fine-grained editing of existing audio tracks in videos by jointly leveraging visual, audio, and text semantics. Specifically, the proposed method employs a specially designed contrastive audio-visual masking autoencoder (CAV-MAE-Edit) for multimodal pre-training, learning aligned cross-modal representations. These representations are then used to train an editorial Multimodal Diffusion Transformer (MM-DiT) capable of removing visually irrelevant sounds and generating missing audio elements consistent with video content through a correlation-based feature gating training strategy. Furthermore, we construct a dedicated video-based sound editing dataset as an evaluation benchmark. Experiments demonstrate that the proposed AV-Edit generates high-quality audio with precise modifications based on visual content, achieving state-of-the-art performance in the field of sound effect editing and exhibiting strong competitiveness in the domain of audio generation.

Related papers

Schrodinger Audio-Visual Editor: Object-Level Audiovisual Removal [90.14887235360611]
SAVEBench is a paired audiovisual dataset with text and mask conditions to enable object-grounded source-to-target learning.<n>SAVE incorporates a Schrodinger Bridge that learns a direct transport from source to target audiovisual mixtures.<n>Our evaluation demonstrates that the proposed SAVE model is able to remove the target objects in audio and visual content while preserving the remaining content.
arXiv Detail & Related papers (2025-12-14T23:19:15Z)
Audio-sync Video Instance Editing with Granularity-Aware Mask Refiner [66.96392168346851]
AVI-Edit is a framework for audio-sync video instance editing.<n>We propose a granularity-aware mask refiner that iteratively refines coarse user-provided masks into precise instance-level regions.<n>We also design a self-feedback audio agent to curate high-quality audio guidance, providing fine-grained temporal control.
arXiv Detail & Related papers (2025-12-11T11:58:53Z)
Coherent Audio-Visual Editing via Conditional Audio Generation Following Video Edits [33.1393328136321]
We introduce a novel pipeline for joint audio-visual editing that enhances the coherence between edited video and its accompanying audio.<n>Our approach first applies state-of-the-art video editing techniques to produce the target video, then performs audio editing to align with the visual changes.
arXiv Detail & Related papers (2025-12-08T06:45:11Z)
Object-AVEdit: An Object-level Audio-Visual Editing Model [79.62095842136115]
We present textbfObject-AVEdit, achieving the object-level audio-visual editing based on the inversion-regeneration paradigm.<n>To achieve the object-level controllability during editing, we develop a word-to-sounding-object well-aligned audio generation model.<n>To achieve the better structural information preservation and object-level editing effect, we propose an inversion-regeneration holistically-optimized editing algorithm.
arXiv Detail & Related papers (2025-09-27T18:12:13Z)
RFM-Editing: Rectified Flow Matching for Text-guided Audio Editing [21.479883699581308]
We propose a novel end-to-end efficient rectified flow matching-based diffusion framework for audio editing.<n> Experiments show that our model achieves faithful semantic alignment without requiring auxiliary captions or masks.
arXiv Detail & Related papers (2025-09-17T14:13:40Z)
ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing [47.14083940177122]
ThinkSound is a novel framework that enables stepwise, interactive audio generation and editing for videos.<n>Our approach decomposes the process into three complementary stages: semantically coherent, interactive object-centric refinement, and targeted editing.<n> Experiments demonstrate that ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics.
arXiv Detail & Related papers (2025-06-26T16:32:06Z)
Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising [114.39028517171236]
We introduce zero-shot audio-video editing, a novel task that requires transforming original audio-visual content to align with a specified textual prompt without additional model training.<n>To evaluate this task, we curate a benchmark dataset, AvED-Bench, designed explicitly for zero-shot audio-video editing.<n>AvED demonstrates superior results on both AvED-Bench and the recent OAVE dataset to validate its generalization capabilities.
arXiv Detail & Related papers (2025-03-26T17:59:04Z)
Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition [72.22243595269389]
We introduce Audio-Agent, a framework for audio generation, editing and composition based on text or video inputs.<n>In our method, we utilize a pre-trained TTA diffusion network as the audio generation agent to work in tandem with GPT-4.<n>For video-to-audio (VTA) tasks, most existing methods require training a timestamp detector to synchronize video events with the generated audio.
arXiv Detail & Related papers (2024-10-04T11:40:53Z)
AudioScenic: Audio-Driven Video Scene Editing [55.098754835213995]
We introduce AudioScenic, an audio-driven framework designed for video scene editing. AudioScenic integrates audio semantics into the visual scene through a temporal-aware audio semantic injection process. We present an audio Magnitude Modulator module that adjusts the temporal dynamics of the scene in response to changes in audio magnitude. Second, the audio Frequency Fuser module is designed to ensure temporal consistency by aligning the frequency of the audio with the dynamics of the video scenes.
arXiv Detail & Related papers (2024-04-25T12:55:58Z)
Audio Editing with Non-Rigid Text Prompts [24.008609489049206]
We show that the proposed editing pipeline is able to create audio edits that remain faithful to the input audio. We explore text prompts that perform addition, style transfer, and in-painting.
arXiv Detail & Related papers (2023-10-19T16:09:44Z)
Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation [89.96013329530484]
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes. We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model. We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples.
arXiv Detail & Related papers (2023-09-28T13:26:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.