SAO-Instruct: Free-form Audio Editing using Natural Language Instructions
- URL: http://arxiv.org/abs/2510.22795v1
- Date: Sun, 26 Oct 2025 18:57:16 GMT
- Title: SAO-Instruct: Free-form Audio Editing using Natural Language Instructions
- Authors: Michael Ungersböck, Florian Grötschla, Luca A. Lanzendörfer, June Young Yi, Changho Choi, Roger Wattenhofer,
- Abstract summary: We introduce SAO-Instruct, a model capable of editing audio clips using any free-form natural language instruction.<n>Our model generalizes well to real in-the-wild audio clips and unseen edit instructions.<n>We demonstrate that SAO-Instruct achieves competitive performance on objective metrics and outperforms other audio editing approaches in a subjective listening study.
- Score: 34.39865893999257
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generative models have made significant progress in synthesizing high-fidelity audio from short textual descriptions. However, editing existing audio using natural language has remained largely underexplored. Current approaches either require the complete description of the edited audio or are constrained to predefined edit instructions that lack flexibility. In this work, we introduce SAO-Instruct, a model based on Stable Audio Open capable of editing audio clips using any free-form natural language instruction. To train our model, we create a dataset of audio editing triplets (input audio, edit instruction, output audio) using Prompt-to-Prompt, DDPM inversion, and a manual editing pipeline. Although partially trained on synthetic data, our model generalizes well to real in-the-wild audio clips and unseen edit instructions. We demonstrate that SAO-Instruct achieves competitive performance on objective metrics and outperforms other audio editing approaches in a subjective listening study. To encourage future research, we release our code and model weights.
Related papers
- Schrodinger Audio-Visual Editor: Object-Level Audiovisual Removal [90.14887235360611]
SAVEBench is a paired audiovisual dataset with text and mask conditions to enable object-grounded source-to-target learning.<n>SAVE incorporates a Schrodinger Bridge that learns a direct transport from source to target audiovisual mixtures.<n>Our evaluation demonstrates that the proposed SAVE model is able to remove the target objects in audio and visual content while preserving the remaining content.
arXiv Detail & Related papers (2025-12-14T23:19:15Z) - Coherent Audio-Visual Editing via Conditional Audio Generation Following Video Edits [33.1393328136321]
We introduce a novel pipeline for joint audio-visual editing that enhances the coherence between edited video and its accompanying audio.<n>Our approach first applies state-of-the-art video editing techniques to produce the target video, then performs audio editing to align with the visual changes.
arXiv Detail & Related papers (2025-12-08T06:45:11Z) - Object-AVEdit: An Object-level Audio-Visual Editing Model [79.62095842136115]
We present textbfObject-AVEdit, achieving the object-level audio-visual editing based on the inversion-regeneration paradigm.<n>To achieve the object-level controllability during editing, we develop a word-to-sounding-object well-aligned audio generation model.<n>To achieve the better structural information preservation and object-level editing effect, we propose an inversion-regeneration holistically-optimized editing algorithm.
arXiv Detail & Related papers (2025-09-27T18:12:13Z) - Guiding Audio Editing with Audio Language Model [13.126858950459557]
We introduce SmartDJ, a novel framework for stereo audio editing.<n>Given a high-level instruction, SmartDJ decomposes it into a sequence of atomic edit operations.<n>These operations are then executed by a diffusion model trained to manipulate stereo audio.
arXiv Detail & Related papers (2025-09-25T21:43:45Z) - EditGen: Harnessing Cross-Attention Control for Instruction-Based Auto-Regressive Audio Editing [54.10773655199149]
We investigate leveraging cross-attention control for efficient audio editing within auto-regressive models.<n>Inspired by image editing methodologies, we develop a Prompt-to-Prompt-like approach that guides edits through cross and self-attention mechanisms.
arXiv Detail & Related papers (2025-07-15T08:44:11Z) - ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing [52.33281620699459]
ThinkSound is a novel framework that leverages Chain-of-Thought (CoT) reasoning to enable stepwise, interactive audio generation and editing for videos.<n>Our approach decomposes the process into three complementary stages: semantically coherent, interactive object-centric refinement through precise user interactions, and targeted editing guided by natural language instructions.<n> Experiments demonstrate that ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics.
arXiv Detail & Related papers (2025-06-26T16:32:06Z) - From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data [55.2480439325792]
Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs.<n>These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks.<n>We propose a data generation framework that produces contrastive-like training data, designed to enhance ALLMs' ability to differentiate between present and absent sounds.
arXiv Detail & Related papers (2025-05-26T16:08:41Z) - Prompt-guided Precise Audio Editing with Diffusion Models [36.29823730882074]
PPAE serves as a general module for diffusion models and enables precise audio editing.
We exploit the cross-attention maps of diffusion models to facilitate accurate local editing and employ a hierarchical local-global pipeline to ensure a smoother editing process.
arXiv Detail & Related papers (2024-05-11T07:41:27Z) - Audio Editing with Non-Rigid Text Prompts [24.008609489049206]
We show that the proposed editing pipeline is able to create audio edits that remain faithful to the input audio.
We explore text prompts that perform addition, style transfer, and in-painting.
arXiv Detail & Related papers (2023-10-19T16:09:44Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - AUDIT: Audio Editing by Following Instructions with Latent Diffusion
Models [40.13710449689338]
AUDIT is an instruction-guided audio editing model based on latent diffusion models.
It achieves state-of-the-art results in both objective and subjective metrics for several audio editing tasks.
arXiv Detail & Related papers (2023-04-03T09:15:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.