Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion
- URL: http://arxiv.org/abs/2402.10009v4
- Date: Wed, 29 May 2024 11:27:24 GMT
- Title: Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion
- Authors: Hila Manor, Tomer Michaeli,
- Abstract summary: We explore two zero-shot editing techniques for audio signals, which use DDPM inversion with pre-trained diffusion models.
The first, which we coin ZEro-shot Text-based Audio (ZETA) editing, is adopted from the image domain.
The second, named ZEro-shot UnSupervized (ZEUS) editing, is a novel approach for discovering semantically meaningful editing directions without supervision.
- Score: 23.89916376623198
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Editing signals using large pre-trained models, in a zero-shot manner, has recently seen rapid advancements in the image domain. However, this wave has yet to reach the audio domain. In this paper, we explore two zero-shot editing techniques for audio signals, which use DDPM inversion with pre-trained diffusion models. The first, which we coin ZEro-shot Text-based Audio (ZETA) editing, is adopted from the image domain. The second, named ZEro-shot UnSupervized (ZEUS) editing, is a novel approach for discovering semantically meaningful editing directions without supervision. When applied to music signals, this method exposes a range of musically interesting modifications, from controlling the participation of specific instruments to improvisations on the melody. Samples and code can be found in https://hilamanor.github.io/AudioEditing/ .
Related papers
- TurboEdit: Text-Based Image Editing Using Few-Step Diffusion Models [53.757752110493215]
We focus on a popular line of text-based editing frameworks - the edit-friendly'' DDPM-noise inversion approach.
We analyze its application to fast sampling methods and categorize its failures into two classes: the appearance of visual artifacts, and insufficient editing strength.
We propose a pseudo-guidance approach that efficiently increases the magnitude of edits without introducing new artifacts.
arXiv Detail & Related papers (2024-08-01T17:27:28Z) - Prompt-guided Precise Audio Editing with Diffusion Models [36.29823730882074]
PPAE serves as a general module for diffusion models and enables precise audio editing.
We exploit the cross-attention maps of diffusion models to facilitate accurate local editing and employ a hierarchical local-global pipeline to ensure a smoother editing process.
arXiv Detail & Related papers (2024-05-11T07:41:27Z) - Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization [70.13218512896032]
Generation of audio from text prompts is an important aspect of such processes in the music and film industry.
Our hypothesis is focusing on how these aspects of audio generation could improve audio generation performance in the presence of limited data.
We synthetically create a preference dataset where each prompt has a winner audio output and some loser audio outputs for the diffusion model to learn from.
arXiv Detail & Related papers (2024-04-15T17:31:22Z) - ZONE: Zero-Shot Instruction-Guided Local Editing [56.56213730578504]
We propose a Zero-shot instructiON-guided local image Editing approach, termed ZONE.
We first convert the editing intent from the user-provided instruction into specific image editing regions through InstructPix2Pix.
We then propose a Region-IoU scheme for precise image layer extraction from an off-the-shelf segment model.
arXiv Detail & Related papers (2023-12-28T02:54:34Z) - MotionEditor: Editing Video Motion via Content-Aware Diffusion [96.825431998349]
MotionEditor is a diffusion model for video motion editing.
It incorporates a novel content-aware motion adapter into ControlNet to capture temporal motion correspondence.
arXiv Detail & Related papers (2023-11-30T18:59:33Z) - InstructME: An Instruction Guided Music Edit And Remix Framework with
Latent Diffusion Models [42.2977676825086]
In this paper, we develop InstructME, an Instruction guided Music Editing and remixing framework based on latent diffusion models.
Our framework fortifies the U-Net with multi-scale aggregation in order to maintain consistency before and after editing.
Our proposed method significantly surpasses preceding systems in music quality, text relevance and harmony.
arXiv Detail & Related papers (2023-08-28T07:11:42Z) - Betray Oneself: A Novel Audio DeepFake Detection Model via
Mono-to-Stereo Conversion [70.99781219121803]
Audio Deepfake Detection (ADD) aims to detect the fake audio generated by text-to-speech (TTS), voice conversion (VC) and replay, etc.
We propose a novel ADD model, termed as M2S-ADD, that attempts to discover audio authenticity cues during the mono-to-stereo conversion process.
arXiv Detail & Related papers (2023-05-25T02:54:29Z) - AUDIT: Audio Editing by Following Instructions with Latent Diffusion
Models [40.13710449689338]
AUDIT is an instruction-guided audio editing model based on latent diffusion models.
It achieves state-of-the-art results in both objective and subjective metrics for several audio editing tasks.
arXiv Detail & Related papers (2023-04-03T09:15:51Z) - FateZero: Fusing Attentions for Zero-shot Text-based Video Editing [104.27329655124299]
We propose FateZero, a zero-shot text-based editing method on real-world videos without per-prompt training or use-specific mask.
Our method is the first one to show the ability of zero-shot text-driven video style and local attribute editing from the trained text-to-image model.
arXiv Detail & Related papers (2023-03-16T17:51:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.