Related papers: Audio-sync Video Instance Editing with Granularity-Aware Mask Refiner

Audio-sync Video Instance Editing with Granularity-Aware Mask Refiner

URL: http://arxiv.org/abs/2512.10571v1
Date: Thu, 11 Dec 2025 11:58:53 GMT
Title: Audio-sync Video Instance Editing with Granularity-Aware Mask Refiner
Authors: Haojie Zheng, Shuchen Weng, Jingqi Liu, Siqi Yang, Boxin Shi, Xinlong Wang,
Abstract summary: AVI-Edit is a framework for audio-sync video instance editing.<n>We propose a granularity-aware mask refiner that iteratively refines coarse user-provided masks into precise instance-level regions.<n>We also design a self-feedback audio agent to curate high-quality audio guidance, providing fine-grained temporal control.
Score: 66.96392168346851
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advancements in video generation highlight that realistic audio-visual synchronization is crucial for engaging content creation. However, existing video editing methods largely overlook audio-visual synchronization and lack the fine-grained spatial and temporal controllability required for precise instance-level edits. In this paper, we propose AVI-Edit, a framework for audio-sync video instance editing. We propose a granularity-aware mask refiner that iteratively refines coarse user-provided masks into precise instance-level regions. We further design a self-feedback audio agent to curate high-quality audio guidance, providing fine-grained temporal control. To facilitate this task, we additionally construct a large-scale dataset with instance-centric correspondence and comprehensive annotations. Extensive experiments demonstrate that AVI-Edit outperforms state-of-the-art methods in visual quality, condition following, and audio-visual synchronization. Project page: https://hjzheng.net/projects/AVI-Edit/.

Related papers

Schrodinger Audio-Visual Editor: Object-Level Audiovisual Removal [90.14887235360611]
SAVEBench is a paired audiovisual dataset with text and mask conditions to enable object-grounded source-to-target learning.<n>SAVE incorporates a Schrodinger Bridge that learns a direct transport from source to target audiovisual mixtures.<n>Our evaluation demonstrates that the proposed SAVE model is able to remove the target objects in audio and visual content while preserving the remaining content.
arXiv Detail & Related papers (2025-12-14T23:19:15Z)
Coherent Audio-Visual Editing via Conditional Audio Generation Following Video Edits [33.1393328136321]
We introduce a novel pipeline for joint audio-visual editing that enhances the coherence between edited video and its accompanying audio.<n>Our approach first applies state-of-the-art video editing techniques to produce the target video, then performs audio editing to align with the visual changes.
arXiv Detail & Related papers (2025-12-08T06:45:11Z)
AV-Edit: Multimodal Generative Sound Effect Editing via Audio-Visual Semantic Joint Control [10.55114688654566]
AV-Edit is a generative sound effect editing framework that enables fine-grained editing of existing audio tracks in videos.<n>The proposed method employs a specially designed contrastive audio-visual masking autoencoder (CAV-MAE-Edit) for multimodal pre-training.<n> Experiments demonstrate that the proposed AV-Edit generates high-quality audio with precise modifications based on visual content.
arXiv Detail & Related papers (2025-11-26T07:59:53Z)
Object-AVEdit: An Object-level Audio-Visual Editing Model [79.62095842136115]
We present textbfObject-AVEdit, achieving the object-level audio-visual editing based on the inversion-regeneration paradigm.<n>To achieve the object-level controllability during editing, we develop a word-to-sounding-object well-aligned audio generation model.<n>To achieve the better structural information preservation and object-level editing effect, we propose an inversion-regeneration holistically-optimized editing algorithm.
arXiv Detail & Related papers (2025-09-27T18:12:13Z)
RFM-Editing: Rectified Flow Matching for Text-guided Audio Editing [21.479883699581308]
We propose a novel end-to-end efficient rectified flow matching-based diffusion framework for audio editing.<n> Experiments show that our model achieves faithful semantic alignment without requiring auxiliary captions or masks.
arXiv Detail & Related papers (2025-09-17T14:13:40Z)
InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing [66.48064661467781]
We introduce sparse-frame video dubbing, a novel paradigm that strategically preserves references to maintain identity, iconic gestures, and camera trajectories.<n>We propose InfiniteTalk, a streaming audio-driven generator designed for infinite-length long sequence dubbing.<n> Comprehensive evaluations on HDTF, CelebV-HQ, and EMTD datasets demonstrate state-of-the-art performance.
arXiv Detail & Related papers (2025-08-19T17:55:23Z)
Audio-Sync Video Generation with Multi-Stream Temporal Control [64.00019697525322]
We introduce MTV, a versatile framework for video generation with precise audio-visual synchronization.<n>MTV separates audios into speech, effects, and tracks, enabling control over lip motion, event timing, and visual mood.<n>To support the framework, we additionally present DEmix, a dataset of high-quality cinematic videos and demixed audio tracks.
arXiv Detail & Related papers (2025-06-09T17:59:42Z)
Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising [114.39028517171236]
We introduce zero-shot audio-video editing, a novel task that requires transforming original audio-visual content to align with a specified textual prompt without additional model training.<n>To evaluate this task, we curate a benchmark dataset, AvED-Bench, designed explicitly for zero-shot audio-video editing.<n>AvED demonstrates superior results on both AvED-Bench and the recent OAVE dataset to validate its generalization capabilities.
arXiv Detail & Related papers (2025-03-26T17:59:04Z)
AudioScenic: Audio-Driven Video Scene Editing [55.098754835213995]
We introduce AudioScenic, an audio-driven framework designed for video scene editing. AudioScenic integrates audio semantics into the visual scene through a temporal-aware audio semantic injection process. We present an audio Magnitude Modulator module that adjusts the temporal dynamics of the scene in response to changes in audio magnitude. Second, the audio Frequency Fuser module is designed to ensure temporal consistency by aligning the frequency of the audio with the dynamics of the video scenes.
arXiv Detail & Related papers (2024-04-25T12:55:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.