RFM-Editing: Rectified Flow Matching for Text-guided Audio Editing
- URL: http://arxiv.org/abs/2509.14003v1
- Date: Wed, 17 Sep 2025 14:13:40 GMT
- Title: RFM-Editing: Rectified Flow Matching for Text-guided Audio Editing
- Authors: Liting Gao, Yi Yuan, Yaru Chen, Yuelan Cheng, Zhenbo Li, Juan Wen, Shubin Zhang, Wenwu Wang,
- Abstract summary: We propose a novel end-to-end efficient rectified flow matching-based diffusion framework for audio editing.<n> Experiments show that our model achieves faithful semantic alignment without requiring auxiliary captions or masks.
- Score: 21.479883699581308
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion models have shown remarkable progress in text-to-audio generation. However, text-guided audio editing remains in its early stages. This task focuses on modifying the target content within an audio signal while preserving the rest, thus demanding precise localization and faithful editing according to the text prompt. Existing training-based and zero-shot methods that rely on full-caption or costly optimization often struggle with complex editing or lack practicality. In this work, we propose a novel end-to-end efficient rectified flow matching-based diffusion framework for audio editing, and construct a dataset featuring overlapping multi-event audio to support training and benchmarking in complex scenarios. Experiments show that our model achieves faithful semantic alignment without requiring auxiliary captions or masks, while maintaining competitive editing quality across metrics.
Related papers
- Audio-sync Video Instance Editing with Granularity-Aware Mask Refiner [66.96392168346851]
AVI-Edit is a framework for audio-sync video instance editing.<n>We propose a granularity-aware mask refiner that iteratively refines coarse user-provided masks into precise instance-level regions.<n>We also design a self-feedback audio agent to curate high-quality audio guidance, providing fine-grained temporal control.
arXiv Detail & Related papers (2025-12-11T11:58:53Z) - AV-Edit: Multimodal Generative Sound Effect Editing via Audio-Visual Semantic Joint Control [10.55114688654566]
AV-Edit is a generative sound effect editing framework that enables fine-grained editing of existing audio tracks in videos.<n>The proposed method employs a specially designed contrastive audio-visual masking autoencoder (CAV-MAE-Edit) for multimodal pre-training.<n> Experiments demonstrate that the proposed AV-Edit generates high-quality audio with precise modifications based on visual content.
arXiv Detail & Related papers (2025-11-26T07:59:53Z) - Audio-Guided Visual Editing with Complex Multi-Modal Prompts [5.694921736486254]
We introduce a novel audio-guided visual editing framework that can handle complex editing tasks with multiple text and audio prompts without requiring training.<n>We leverage a pre-trained multi-modal encoder with strong zero-shot capabilities and integrate diverse audio into visual editing tasks.<n>Our framework excels in handling complicated editing scenarios by incorporating rich information from audio, where text-only approaches fail.
arXiv Detail & Related papers (2025-08-28T03:00:30Z) - EditGen: Harnessing Cross-Attention Control for Instruction-Based Auto-Regressive Audio Editing [54.10773655199149]
We investigate leveraging cross-attention control for efficient audio editing within auto-regressive models.<n>Inspired by image editing methodologies, we develop a Prompt-to-Prompt-like approach that guides edits through cross and self-attention mechanisms.
arXiv Detail & Related papers (2025-07-15T08:44:11Z) - Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition [72.22243595269389]
We introduce Audio-Agent, a framework for audio generation, editing and composition based on text or video inputs.<n>In our method, we utilize a pre-trained TTA diffusion network as the audio generation agent to work in tandem with GPT-4.<n>For video-to-audio (VTA) tasks, most existing methods require training a timestamp detector to synchronize video events with the generated audio.
arXiv Detail & Related papers (2024-10-04T11:40:53Z) - FluentEditor2: Text-based Speech Editing by Modeling Multi-Scale Acoustic and Prosody Consistency [40.95700389032375]
Text-based speech editing (TSE) allows users to edit speech by modifying the corresponding text directly without altering the original recording.<n>Current TSE techniques often focus on minimizing discrepancies between generated speech and reference within edited regions during training to achieve fluent TSE performance.<n>We propose a new fluency speech editing scheme based on our previous textitFluentEditor model, termed textittextbfFluentEditor2.
arXiv Detail & Related papers (2024-09-28T10:18:35Z) - TurboEdit: Text-Based Image Editing Using Few-Step Diffusion Models [53.757752110493215]
We focus on a popular line of text-based editing frameworks - the edit-friendly'' DDPM-noise inversion approach.
We analyze its application to fast sampling methods and categorize its failures into two classes: the appearance of visual artifacts, and insufficient editing strength.
We propose a pseudo-guidance approach that efficiently increases the magnitude of edits without introducing new artifacts.
arXiv Detail & Related papers (2024-08-01T17:27:28Z) - Prompt-guided Precise Audio Editing with Diffusion Models [36.29823730882074]
PPAE serves as a general module for diffusion models and enables precise audio editing.
We exploit the cross-attention maps of diffusion models to facilitate accurate local editing and employ a hierarchical local-global pipeline to ensure a smoother editing process.
arXiv Detail & Related papers (2024-05-11T07:41:27Z) - Audio Editing with Non-Rigid Text Prompts [24.008609489049206]
We show that the proposed editing pipeline is able to create audio edits that remain faithful to the input audio.
We explore text prompts that perform addition, style transfer, and in-painting.
arXiv Detail & Related papers (2023-10-19T16:09:44Z) - FluentEditor: Text-based Speech Editing by Considering Acoustic and
Prosody Consistency [44.7425844190807]
Text-based speech editing (TSE) techniques are designed to enable users to edit the output audio by modifying the input text transcript instead of the audio itself.
We propose a fluency speech editing model, termed textitFluentEditor, by considering fluency-aware training criterion in the TSE training.
The subjective and objective experimental results on VCTK demonstrate that our textitFluentEditor outperforms all advanced baselines in terms of naturalness and fluency.
arXiv Detail & Related papers (2023-09-21T01:58:01Z) - Efficient Audio Captioning Transformer with Patchout and Text Guidance [74.59739661383726]
We propose a full Transformer architecture that utilizes Patchout as proposed in [1], significantly reducing the computational complexity and avoiding overfitting.
The caption generation is partly conditioned on textual AudioSet tags extracted by a pre-trained classification model.
Our proposed method received the Judges Award at the Task6A of DCASE Challenge 2022.
arXiv Detail & Related papers (2023-04-06T07:58:27Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.