AUDIT: Audio Editing by Following Instructions with Latent Diffusion
Models
- URL: http://arxiv.org/abs/2304.00830v2
- Date: Wed, 5 Apr 2023 12:13:48 GMT
- Title: AUDIT: Audio Editing by Following Instructions with Latent Diffusion
Models
- Authors: Yuancheng Wang, Zeqian Ju, Xu Tan, Lei He, Zhizheng Wu, Jiang Bian,
Sheng Zhao
- Abstract summary: AUDIT is an instruction-guided audio editing model based on latent diffusion models.
It achieves state-of-the-art results in both objective and subjective metrics for several audio editing tasks.
- Score: 40.13710449689338
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio editing is applicable for various purposes, such as adding background
sound effects, replacing a musical instrument, and repairing damaged audio.
Recently, some diffusion-based methods achieved zero-shot audio editing by
using a diffusion and denoising process conditioned on the text description of
the output audio. However, these methods still have some problems: 1) they have
not been trained on editing tasks and cannot ensure good editing effects; 2)
they can erroneously modify audio segments that do not require editing; 3) they
need a complete description of the output audio, which is not always available
or necessary in practical scenarios. In this work, we propose AUDIT, an
instruction-guided audio editing model based on latent diffusion models.
Specifically, AUDIT has three main design features: 1) we construct triplet
training data (instruction, input audio, output audio) for different audio
editing tasks and train a diffusion model using instruction and input (to be
edited) audio as conditions and generating output (edited) audio; 2) it can
automatically learn to only modify segments that need to be edited by comparing
the difference between the input and output audio; 3) it only needs edit
instructions instead of full target audio descriptions as text input. AUDIT
achieves state-of-the-art results in both objective and subjective metrics for
several audio editing tasks (e.g., adding, dropping, replacement, inpainting,
super-resolution). Demo samples are available at https://audit-demo.github.io/.
Related papers
- Language-Guided Joint Audio-Visual Editing via One-Shot Adaptation [56.92841782969847]
We introduce a novel task called language-guided joint audio-visual editing.
Given an audio and image pair of a sounding event, this task aims at generating new audio-visual content by editing the given sounding event conditioned on the language guidance.
We propose a new diffusion-based framework for joint audio-visual editing and introduce two key ideas.
arXiv Detail & Related papers (2024-10-09T22:02:30Z) - Prompt-guided Precise Audio Editing with Diffusion Models [36.29823730882074]
PPAE serves as a general module for diffusion models and enables precise audio editing.
We exploit the cross-attention maps of diffusion models to facilitate accurate local editing and employ a hierarchical local-global pipeline to ensure a smoother editing process.
arXiv Detail & Related papers (2024-05-11T07:41:27Z) - AudioScenic: Audio-Driven Video Scene Editing [55.098754835213995]
We introduce AudioScenic, an audio-driven framework designed for video scene editing.
AudioScenic integrates audio semantics into the visual scene through a temporal-aware audio semantic injection process.
We present an audio Magnitude Modulator module that adjusts the temporal dynamics of the scene in response to changes in audio magnitude.
Second, the audio Frequency Fuser module is designed to ensure temporal consistency by aligning the frequency of the audio with the dynamics of the video scenes.
arXiv Detail & Related papers (2024-04-25T12:55:58Z) - Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion [23.89916376623198]
We explore two zero-shot editing techniques for audio signals, which use DDPM inversion with pre-trained diffusion models.
The first, which we coin ZEro-shot Text-based Audio (ZETA) editing, is adopted from the image domain.
The second, named ZEro-shot UnSupervized (ZEUS) editing, is a novel approach for discovering semantically meaningful editing directions without supervision.
arXiv Detail & Related papers (2024-02-15T15:17:26Z) - SyncFusion: Multimodal Onset-synchronized Video-to-Audio Foley Synthesis [9.118448725265669]
One of the most time-consuming steps when designing sound is synchronizing audio with video.
In video games and animations, no reference audio exists, requiring manual annotation of event timings from the video.
We propose a system to extract repetitive actions onsets from a video, which are then used to condition a diffusion model trained to generate a new synchronized sound effects audio track.
arXiv Detail & Related papers (2023-10-23T18:01:36Z) - Audio Editing with Non-Rigid Text Prompts [24.008609489049206]
We show that the proposed editing pipeline is able to create audio edits that remain faithful to the input audio.
We explore text prompts that perform addition, style transfer, and in-painting.
arXiv Detail & Related papers (2023-10-19T16:09:44Z) - Editing 3D Scenes via Text Prompts without Retraining [80.57814031701744]
DN2N is a text-driven editing method that allows for the direct acquisition of a NeRF model with universal editing capabilities.
Our method employs off-the-shelf text-based editing models of 2D images to modify the 3D scene images.
Our method achieves multiple editing types, including but not limited to appearance editing, weather transition, material changing, and style transfer.
arXiv Detail & Related papers (2023-09-10T02:31:50Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.