Schrodinger Audio-Visual Editor: Object-Level Audiovisual Removal
- URL: http://arxiv.org/abs/2512.12875v1
- Date: Sun, 14 Dec 2025 23:19:15 GMT
- Title: Schrodinger Audio-Visual Editor: Object-Level Audiovisual Removal
- Authors: Weihan Xu, Kan Jen Cheng, Koichi Saito, Muhammad Jehanzeb Mirza, Tingle Li, Yisi Liu, Alexander H. Liu, Liming Wang, Masato Ishii, Takashi Shibuya, Yuki Mitsufuji, Gopala Anumanchipalli, Paul Pu Liang,
- Abstract summary: SAVEBench is a paired audiovisual dataset with text and mask conditions to enable object-grounded source-to-target learning.<n>SAVE incorporates a Schrodinger Bridge that learns a direct transport from source to target audiovisual mixtures.<n>Our evaluation demonstrates that the proposed SAVE model is able to remove the target objects in audio and visual content while preserving the remaining content.
- Score: 90.14887235360611
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Joint editing of audio and visual content is crucial for precise and controllable content creation. This new task poses challenges due to the limitations of paired audio-visual data before and after targeted edits, and the heterogeneity across modalities. To address the data and modeling challenges in joint audio-visual editing, we introduce SAVEBench, a paired audiovisual dataset with text and mask conditions to enable object-grounded source-to-target learning. With SAVEBench, we train the Schrodinger Audio-Visual Editor (SAVE), an end-to-end flow-matching model that edits audio and video in parallel while keeping them aligned throughout processing. SAVE incorporates a Schrodinger Bridge that learns a direct transport from source to target audiovisual mixtures. Our evaluation demonstrates that the proposed SAVE model is able to remove the target objects in audio and visual content while preserving the remaining content, with stronger temporal synchronization and audiovisual semantic correspondence compared with pairwise combinations of an audio editor and a video editor.
Related papers
- Audio-sync Video Instance Editing with Granularity-Aware Mask Refiner [66.96392168346851]
AVI-Edit is a framework for audio-sync video instance editing.<n>We propose a granularity-aware mask refiner that iteratively refines coarse user-provided masks into precise instance-level regions.<n>We also design a self-feedback audio agent to curate high-quality audio guidance, providing fine-grained temporal control.
arXiv Detail & Related papers (2025-12-11T11:58:53Z) - Coherent Audio-Visual Editing via Conditional Audio Generation Following Video Edits [33.1393328136321]
We introduce a novel pipeline for joint audio-visual editing that enhances the coherence between edited video and its accompanying audio.<n>Our approach first applies state-of-the-art video editing techniques to produce the target video, then performs audio editing to align with the visual changes.
arXiv Detail & Related papers (2025-12-08T06:45:11Z) - AV-Edit: Multimodal Generative Sound Effect Editing via Audio-Visual Semantic Joint Control [10.55114688654566]
AV-Edit is a generative sound effect editing framework that enables fine-grained editing of existing audio tracks in videos.<n>The proposed method employs a specially designed contrastive audio-visual masking autoencoder (CAV-MAE-Edit) for multimodal pre-training.<n> Experiments demonstrate that the proposed AV-Edit generates high-quality audio with precise modifications based on visual content.
arXiv Detail & Related papers (2025-11-26T07:59:53Z) - Object-AVEdit: An Object-level Audio-Visual Editing Model [79.62095842136115]
We present textbfObject-AVEdit, achieving the object-level audio-visual editing based on the inversion-regeneration paradigm.<n>To achieve the object-level controllability during editing, we develop a word-to-sounding-object well-aligned audio generation model.<n>To achieve the better structural information preservation and object-level editing effect, we propose an inversion-regeneration holistically-optimized editing algorithm.
arXiv Detail & Related papers (2025-09-27T18:12:13Z) - Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising [114.39028517171236]
We introduce zero-shot audio-video editing, a novel task that requires transforming original audio-visual content to align with a specified textual prompt without additional model training.<n>To evaluate this task, we curate a benchmark dataset, AvED-Bench, designed explicitly for zero-shot audio-video editing.<n>AvED demonstrates superior results on both AvED-Bench and the recent OAVE dataset to validate its generalization capabilities.
arXiv Detail & Related papers (2025-03-26T17:59:04Z) - Language-Guided Joint Audio-Visual Editing via One-Shot Adaptation [56.92841782969847]
We introduce a novel task called language-guided joint audio-visual editing.
Given an audio and image pair of a sounding event, this task aims at generating new audio-visual content by editing the given sounding event conditioned on the language guidance.
We propose a new diffusion-based framework for joint audio-visual editing and introduce two key ideas.
arXiv Detail & Related papers (2024-10-09T22:02:30Z) - AudioScenic: Audio-Driven Video Scene Editing [55.098754835213995]
We introduce AudioScenic, an audio-driven framework designed for video scene editing.
AudioScenic integrates audio semantics into the visual scene through a temporal-aware audio semantic injection process.
We present an audio Magnitude Modulator module that adjusts the temporal dynamics of the scene in response to changes in audio magnitude.
Second, the audio Frequency Fuser module is designed to ensure temporal consistency by aligning the frequency of the audio with the dynamics of the video scenes.
arXiv Detail & Related papers (2024-04-25T12:55:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.