Recomposer: Event-roll-guided generative audio editing
- URL: http://arxiv.org/abs/2509.05256v1
- Date: Fri, 05 Sep 2025 17:14:29 GMT
- Title: Recomposer: Event-roll-guided generative audio editing
- Authors: Daniel P. W. Ellis, Eduardo Fonseca, Ron J. Weiss, Kevin Wilson, Scott Wisdom, Hakan Erdogan, John R. Hershey, Aren Jansen, R. Channing Moore, Manoj Plakal,
- Abstract summary: We present a system for editing individual sound events within complex scenes able to delete, insert, and enhance individual sound events.<n>We present an encoder-decoder transformer working on SoundStream representations, trained on synthetic (input, desired output) audio example pairs.
- Score: 20.394283728168805
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Editing complex real-world sound scenes is difficult because individual sound sources overlap in time. Generative models can fill-in missing or corrupted details based on their strong prior understanding of the data domain. We present a system for editing individual sound events within complex scenes able to delete, insert, and enhance individual sound events based on textual edit descriptions (e.g., ``enhance Door'') and a graphical representation of the event timing derived from an ``event roll'' transcription. We present an encoder-decoder transformer working on SoundStream representations, trained on synthetic (input, desired output) audio example pairs formed by adding isolated sound events to dense, real-world backgrounds. Evaluation reveals the importance of each part of the edit descriptions -- action, class, timing. Our work demonstrates ``recomposition'' is an important and practical application.
Related papers
- Bagpiper: Solving Open-Ended Audio Tasks via Rich Captions [84.73122243726775]
Bagpiper is an 8B audio foundation model that interprets physical audio via rich captions.<n>During fine-tuning, Bagpiper adopts a caption-then-process workflow to solve diverse tasks without task-specific priors.<n>To the best of our knowledge, Bagpiper is among the first works that achieve unified understanding generation for general audio.
arXiv Detail & Related papers (2026-02-05T02:20:07Z) - Schrodinger Audio-Visual Editor: Object-Level Audiovisual Removal [90.14887235360611]
SAVEBench is a paired audiovisual dataset with text and mask conditions to enable object-grounded source-to-target learning.<n>SAVE incorporates a Schrodinger Bridge that learns a direct transport from source to target audiovisual mixtures.<n>Our evaluation demonstrates that the proposed SAVE model is able to remove the target objects in audio and visual content while preserving the remaining content.
arXiv Detail & Related papers (2025-12-14T23:19:15Z) - AV-Edit: Multimodal Generative Sound Effect Editing via Audio-Visual Semantic Joint Control [10.55114688654566]
AV-Edit is a generative sound effect editing framework that enables fine-grained editing of existing audio tracks in videos.<n>The proposed method employs a specially designed contrastive audio-visual masking autoencoder (CAV-MAE-Edit) for multimodal pre-training.<n> Experiments demonstrate that the proposed AV-Edit generates high-quality audio with precise modifications based on visual content.
arXiv Detail & Related papers (2025-11-26T07:59:53Z) - RFM-Editing: Rectified Flow Matching for Text-guided Audio Editing [21.479883699581308]
We propose a novel end-to-end efficient rectified flow matching-based diffusion framework for audio editing.<n> Experiments show that our model achieves faithful semantic alignment without requiring auxiliary captions or masks.
arXiv Detail & Related papers (2025-09-17T14:13:40Z) - Audio-Guided Visual Editing with Complex Multi-Modal Prompts [5.694921736486254]
We introduce a novel audio-guided visual editing framework that can handle complex editing tasks with multiple text and audio prompts without requiring training.<n>We leverage a pre-trained multi-modal encoder with strong zero-shot capabilities and integrate diverse audio into visual editing tasks.<n>Our framework excels in handling complicated editing scenarios by incorporating rich information from audio, where text-only approaches fail.
arXiv Detail & Related papers (2025-08-28T03:00:30Z) - ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing [52.33281620699459]
ThinkSound is a novel framework that leverages Chain-of-Thought (CoT) reasoning to enable stepwise, interactive audio generation and editing for videos.<n>Our approach decomposes the process into three complementary stages: semantically coherent, interactive object-centric refinement through precise user interactions, and targeted editing guided by natural language instructions.<n> Experiments demonstrate that ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics.
arXiv Detail & Related papers (2025-06-26T16:32:06Z) - FolAI: Synchronized Foley Sound Generation with Semantic and Temporal Alignment [11.796771978828403]
We introduce FolAI, a two-stage generative framework that produces temporally coherent and semantically controllable sound effects from video.<n>Results show that our model reliably produces audio that is temporally aligned with visual motion, semantically consistent with user intent, and perceptually realistic.<n>These findings highlight the potential of FolAI as a controllable and modular solution for scalable, high-quality Foley sound synthesis in professional and interactive settings.
arXiv Detail & Related papers (2024-12-19T16:37:19Z) - Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition [72.22243595269389]
We introduce Audio-Agent, a framework for audio generation, editing and composition based on text or video inputs.<n>In our method, we utilize a pre-trained TTA diffusion network as the audio generation agent to work in tandem with GPT-4.<n>For video-to-audio (VTA) tasks, most existing methods require training a timestamp detector to synchronize video events with the generated audio.
arXiv Detail & Related papers (2024-10-04T11:40:53Z) - Prompt-guided Precise Audio Editing with Diffusion Models [36.29823730882074]
PPAE serves as a general module for diffusion models and enables precise audio editing.
We exploit the cross-attention maps of diffusion models to facilitate accurate local editing and employ a hierarchical local-global pipeline to ensure a smoother editing process.
arXiv Detail & Related papers (2024-05-11T07:41:27Z) - Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization [70.13218512896032]
Generation of audio from text prompts is an important aspect of such processes in the music and film industry.
Our hypothesis is focusing on how these aspects of audio generation could improve audio generation performance in the presence of limited data.
We synthetically create a preference dataset where each prompt has a winner audio output and some loser audio outputs for the diffusion model to learn from.
arXiv Detail & Related papers (2024-04-15T17:31:22Z) - SyncFusion: Multimodal Onset-synchronized Video-to-Audio Foley Synthesis [9.118448725265669]
One of the most time-consuming steps when designing sound is synchronizing audio with video.
In video games and animations, no reference audio exists, requiring manual annotation of event timings from the video.
We propose a system to extract repetitive actions onsets from a video, which are then used to condition a diffusion model trained to generate a new synchronized sound effects audio track.
arXiv Detail & Related papers (2023-10-23T18:01:36Z) - WavJourney: Compositional Audio Creation with Large Language Models [38.39551216587242]
We present WavJourney, a novel framework that leverages Large Language Models to connect various audio models for audio creation.
WavJourney allows users to create storytelling audio content with diverse audio elements simply from textual descriptions.
We show that WavJourney is capable of synthesizing realistic audio aligned with textually-described semantic, spatial and temporal conditions.
arXiv Detail & Related papers (2023-07-26T17:54:04Z) - Epic-Sounds: A Large-scale Dataset of Actions That Sound [64.24297230981168]
EPIC-SOUNDS includes 78.4k categorised segments of audible events and actions, distributed across 44 classes as well as 39.2k non-categorised segments.<n>We train and evaluate state-of-the-art audio recognition and detection models on our dataset, for both audio-only and audio-visual methods.
arXiv Detail & Related papers (2023-02-01T18:19:37Z) - Generating Visually Aligned Sound from Videos [83.89485254543888]
We focus on the task of generating sound from natural videos.
The sound should be both temporally and content-wise aligned with visual signals.
Some sounds generated outside of a camera can not be inferred from video content.
arXiv Detail & Related papers (2020-07-14T07:51:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.