Object-AVEdit: An Object-level Audio-Visual Editing Model
- URL: http://arxiv.org/abs/2510.00050v1
- Date: Sat, 27 Sep 2025 18:12:13 GMT
- Title: Object-AVEdit: An Object-level Audio-Visual Editing Model
- Authors: Youquan Fu, Ruiyang Si, Hongfa Wang, Dongzhan Zhou, Jiacheng Sun, Ping Luo, Di Hu, Hongyuan Zhang, Xuelong Li,
- Abstract summary: We present textbfObject-AVEdit, achieving the object-level audio-visual editing based on the inversion-regeneration paradigm.<n>To achieve the object-level controllability during editing, we develop a word-to-sounding-object well-aligned audio generation model.<n>To achieve the better structural information preservation and object-level editing effect, we propose an inversion-regeneration holistically-optimized editing algorithm.
- Score: 79.62095842136115
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There is a high demand for audio-visual editing in video post-production and the film making field. While numerous models have explored audio and video editing, they struggle with object-level audio-visual operations. Specifically, object-level audio-visual editing requires the ability to perform object addition, replacement, and removal across both audio and visual modalities, while preserving the structural information of the source instances during the editing process. In this paper, we present \textbf{Object-AVEdit}, achieving the object-level audio-visual editing based on the inversion-regeneration paradigm. To achieve the object-level controllability during editing, we develop a word-to-sounding-object well-aligned audio generation model, bridging the gap in object-controllability between audio and current video generation models. Meanwhile, to achieve the better structural information preservation and object-level editing effect, we propose an inversion-regeneration holistically-optimized editing algorithm, ensuring both information retention during the inversion and better regeneration effect. Extensive experiments demonstrate that our editing model achieved advanced results in both audio-video object-level editing tasks with fine audio-visual semantic alignment. In addition, our developed audio generation model also achieved advanced performance. More results on our project page: https://gewu-lab.github.io/Object_AVEdit-website/.
Related papers
- Schrodinger Audio-Visual Editor: Object-Level Audiovisual Removal [90.14887235360611]
SAVEBench is a paired audiovisual dataset with text and mask conditions to enable object-grounded source-to-target learning.<n>SAVE incorporates a Schrodinger Bridge that learns a direct transport from source to target audiovisual mixtures.<n>Our evaluation demonstrates that the proposed SAVE model is able to remove the target objects in audio and visual content while preserving the remaining content.
arXiv Detail & Related papers (2025-12-14T23:19:15Z) - Audio-sync Video Instance Editing with Granularity-Aware Mask Refiner [66.96392168346851]
AVI-Edit is a framework for audio-sync video instance editing.<n>We propose a granularity-aware mask refiner that iteratively refines coarse user-provided masks into precise instance-level regions.<n>We also design a self-feedback audio agent to curate high-quality audio guidance, providing fine-grained temporal control.
arXiv Detail & Related papers (2025-12-11T11:58:53Z) - Coherent Audio-Visual Editing via Conditional Audio Generation Following Video Edits [33.1393328136321]
We introduce a novel pipeline for joint audio-visual editing that enhances the coherence between edited video and its accompanying audio.<n>Our approach first applies state-of-the-art video editing techniques to produce the target video, then performs audio editing to align with the visual changes.
arXiv Detail & Related papers (2025-12-08T06:45:11Z) - AV-Edit: Multimodal Generative Sound Effect Editing via Audio-Visual Semantic Joint Control [10.55114688654566]
AV-Edit is a generative sound effect editing framework that enables fine-grained editing of existing audio tracks in videos.<n>The proposed method employs a specially designed contrastive audio-visual masking autoencoder (CAV-MAE-Edit) for multimodal pre-training.<n> Experiments demonstrate that the proposed AV-Edit generates high-quality audio with precise modifications based on visual content.
arXiv Detail & Related papers (2025-11-26T07:59:53Z) - Guiding Audio Editing with Audio Language Model [13.126858950459557]
We introduce SmartDJ, a novel framework for stereo audio editing.<n>Given a high-level instruction, SmartDJ decomposes it into a sequence of atomic edit operations.<n>These operations are then executed by a diffusion model trained to manipulate stereo audio.
arXiv Detail & Related papers (2025-09-25T21:43:45Z) - EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning [58.53074381801114]
We introduce EditVerse, a unified framework for image and video generation and editing within a single model.<n>By representing all modalities, i.e. text, image, and video, as a unified token sequence, EditVerse leverages self-attention to achieve robust in-context learning.<n>We present EditVerseBench, the first benchmark for instruction-based video editing covering diverse tasks and resolutions.
arXiv Detail & Related papers (2025-09-24T17:59:30Z) - Hear-Your-Click: Interactive Object-Specific Video-to-Audio Generation [6.631248829195371]
We introduce Hear-Your-Click, an interactive V2A framework enabling users to generate sounds for specific objects by clicking on the frame.<n>To achieve this, we propose Object-aware Contrastive Audio-Visual Fine-tuning (OCAV) with a Mask-guided Visual (MVE) to obtain object-level visual features aligned with audio.<n>To measure audio-visual correspondence, we designed a new evaluation metric, the CAV score.
arXiv Detail & Related papers (2025-07-07T13:01:50Z) - Language-Guided Joint Audio-Visual Editing via One-Shot Adaptation [56.92841782969847]
We introduce a novel task called language-guided joint audio-visual editing.
Given an audio and image pair of a sounding event, this task aims at generating new audio-visual content by editing the given sounding event conditioned on the language guidance.
We propose a new diffusion-based framework for joint audio-visual editing and introduce two key ideas.
arXiv Detail & Related papers (2024-10-09T22:02:30Z) - AudioScenic: Audio-Driven Video Scene Editing [55.098754835213995]
We introduce AudioScenic, an audio-driven framework designed for video scene editing.
AudioScenic integrates audio semantics into the visual scene through a temporal-aware audio semantic injection process.
We present an audio Magnitude Modulator module that adjusts the temporal dynamics of the scene in response to changes in audio magnitude.
Second, the audio Frequency Fuser module is designed to ensure temporal consistency by aligning the frequency of the audio with the dynamics of the video scenes.
arXiv Detail & Related papers (2024-04-25T12:55:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.