Audio Editing with Non-Rigid Text Prompts
- URL: http://arxiv.org/abs/2310.12858v3
- Date: Tue, 24 Sep 2024 11:25:49 GMT
- Title: Audio Editing with Non-Rigid Text Prompts
- Authors: Francesco Paissan, Luca Della Libera, Zhepei Wang, Mirco Ravanelli, Paris Smaragdis, Cem Subakan,
- Abstract summary: We show that the proposed editing pipeline is able to create audio edits that remain faithful to the input audio.
We explore text prompts that perform addition, style transfer, and in-painting.
- Score: 24.008609489049206
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we explore audio-editing with non-rigid text edits. We show that the proposed editing pipeline is able to create audio edits that remain faithful to the input audio. We explore text prompts that perform addition, style transfer, and in-painting. We quantitatively and qualitatively show that the edits are able to obtain results which outperform Audio-LDM, a recently released text-prompted audio generation model. Qualitative inspection of the results points out that the edits given by our approach remain more faithful to the input audio in terms of keeping the original onsets and offsets of the audio events.
Related papers
- Schrodinger Audio-Visual Editor: Object-Level Audiovisual Removal [90.14887235360611]
SAVEBench is a paired audiovisual dataset with text and mask conditions to enable object-grounded source-to-target learning.<n>SAVE incorporates a Schrodinger Bridge that learns a direct transport from source to target audiovisual mixtures.<n>Our evaluation demonstrates that the proposed SAVE model is able to remove the target objects in audio and visual content while preserving the remaining content.
arXiv Detail & Related papers (2025-12-14T23:19:15Z) - Coherent Audio-Visual Editing via Conditional Audio Generation Following Video Edits [33.1393328136321]
We introduce a novel pipeline for joint audio-visual editing that enhances the coherence between edited video and its accompanying audio.<n>Our approach first applies state-of-the-art video editing techniques to produce the target video, then performs audio editing to align with the visual changes.
arXiv Detail & Related papers (2025-12-08T06:45:11Z) - AV-Edit: Multimodal Generative Sound Effect Editing via Audio-Visual Semantic Joint Control [10.55114688654566]
AV-Edit is a generative sound effect editing framework that enables fine-grained editing of existing audio tracks in videos.<n>The proposed method employs a specially designed contrastive audio-visual masking autoencoder (CAV-MAE-Edit) for multimodal pre-training.<n> Experiments demonstrate that the proposed AV-Edit generates high-quality audio with precise modifications based on visual content.
arXiv Detail & Related papers (2025-11-26T07:59:53Z) - SAO-Instruct: Free-form Audio Editing using Natural Language Instructions [34.39865893999257]
We introduce SAO-Instruct, a model capable of editing audio clips using any free-form natural language instruction.<n>Our model generalizes well to real in-the-wild audio clips and unseen edit instructions.<n>We demonstrate that SAO-Instruct achieves competitive performance on objective metrics and outperforms other audio editing approaches in a subjective listening study.
arXiv Detail & Related papers (2025-10-26T18:57:16Z) - Object-AVEdit: An Object-level Audio-Visual Editing Model [79.62095842136115]
We present textbfObject-AVEdit, achieving the object-level audio-visual editing based on the inversion-regeneration paradigm.<n>To achieve the object-level controllability during editing, we develop a word-to-sounding-object well-aligned audio generation model.<n>To achieve the better structural information preservation and object-level editing effect, we propose an inversion-regeneration holistically-optimized editing algorithm.
arXiv Detail & Related papers (2025-09-27T18:12:13Z) - RFM-Editing: Rectified Flow Matching for Text-guided Audio Editing [21.479883699581308]
We propose a novel end-to-end efficient rectified flow matching-based diffusion framework for audio editing.<n> Experiments show that our model achieves faithful semantic alignment without requiring auxiliary captions or masks.
arXiv Detail & Related papers (2025-09-17T14:13:40Z) - Language-Guided Joint Audio-Visual Editing via One-Shot Adaptation [56.92841782969847]
We introduce a novel task called language-guided joint audio-visual editing.
Given an audio and image pair of a sounding event, this task aims at generating new audio-visual content by editing the given sounding event conditioned on the language guidance.
We propose a new diffusion-based framework for joint audio-visual editing and introduce two key ideas.
arXiv Detail & Related papers (2024-10-09T22:02:30Z) - Speech Editing -- a Summary [8.713498822221222]
This paper explores text-based speech editing methods that modify audio via text transcripts without manual waveform editing.
The aim is to highlight ongoing issues and inspire further research and innovation in speech editing.
arXiv Detail & Related papers (2024-07-24T11:22:57Z) - Improving Text-To-Audio Models with Synthetic Captions [51.19111942748637]
We propose an audio captioning pipeline that uses an textitaudio language model to synthesize accurate and diverse captions for audio at scale.
We leverage this pipeline to produce a dataset of synthetic captions for AudioSet, named textttAF-AudioSet, and then evaluate the benefit of pre-training text-to-audio models on these synthetic captions.
arXiv Detail & Related papers (2024-06-18T00:02:15Z) - Prompt-guided Precise Audio Editing with Diffusion Models [36.29823730882074]
PPAE serves as a general module for diffusion models and enables precise audio editing.
We exploit the cross-attention maps of diffusion models to facilitate accurate local editing and employ a hierarchical local-global pipeline to ensure a smoother editing process.
arXiv Detail & Related papers (2024-05-11T07:41:27Z) - AudioScenic: Audio-Driven Video Scene Editing [55.098754835213995]
We introduce AudioScenic, an audio-driven framework designed for video scene editing.
AudioScenic integrates audio semantics into the visual scene through a temporal-aware audio semantic injection process.
We present an audio Magnitude Modulator module that adjusts the temporal dynamics of the scene in response to changes in audio magnitude.
Second, the audio Frequency Fuser module is designed to ensure temporal consistency by aligning the frequency of the audio with the dynamics of the video scenes.
arXiv Detail & Related papers (2024-04-25T12:55:58Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - Looking and Listening: Audio Guided Text Recognition [62.98768236858089]
Text recognition in the wild is a long-standing problem in computer vision.
Recent studies suggest vision and language processing are effective for scene text recognition.
Yet, solving edit errors such as add, delete, or replace is still the main challenge for existing approaches.
We propose the AudioOCR, a simple yet effective probabilistic audio decoder for mel spectrogram sequence prediction.
arXiv Detail & Related papers (2023-06-06T08:08:18Z) - AUDIT: Audio Editing by Following Instructions with Latent Diffusion
Models [40.13710449689338]
AUDIT is an instruction-guided audio editing model based on latent diffusion models.
It achieves state-of-the-art results in both objective and subjective metrics for several audio editing tasks.
arXiv Detail & Related papers (2023-04-03T09:15:51Z) - AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z) - Context-Aware Prosody Correction for Text-Based Speech Editing [28.459695630420832]
A major drawback of current systems is that edited recordings often sound unnatural because of prosody mismatches around edited regions.
We propose a new context-aware method for more natural sounding text-based editing of speech.
arXiv Detail & Related papers (2021-02-16T18:16:30Z) - Audio Captioning using Gated Recurrent Units [1.3960152426268766]
VGGish audio embedding model is used to explore the usability of audio embeddings in the audio captioning task.
The proposed architecture encodes audio and text input modalities separately and combines them before the decoding stage.
Our experimental results show that the proposed BiGRU-based deep model outperforms the state of the art results.
arXiv Detail & Related papers (2020-06-05T12:03:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.