FM Tone Transfer with Envelope Learning
- URL: http://arxiv.org/abs/2310.04811v1
- Date: Sat, 7 Oct 2023 14:03:25 GMT
- Title: FM Tone Transfer with Envelope Learning
- Authors: Franco Caspe, Andrew McPherson and Mark Sandler
- Abstract summary: Tone Transfer is a novel technique for interfacing a sound source with a synthesizer, transforming the timbre of audio excerpts while keeping their musical form content.
It presents several shortcomings related to poor sound diversity, and limited transient and dynamic rendering, which we believe hinder its possibilities of articulation and phrasing in a real-time performance context.
- Score: 8.771755521263811
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Tone Transfer is a novel deep-learning technique for interfacing a sound
source with a synthesizer, transforming the timbre of audio excerpts while
keeping their musical form content. Due to its good audio quality results and
continuous controllability, it has been recently applied in several audio
processing tools. Nevertheless, it still presents several shortcomings related
to poor sound diversity, and limited transient and dynamic rendering, which we
believe hinder its possibilities of articulation and phrasing in a real-time
performance context.
In this work, we present a discussion on current Tone Transfer architectures
for the task of controlling synthetic audio with musical instruments and
discuss their challenges in allowing expressive performances. Next, we
introduce Envelope Learning, a novel method for designing Tone Transfer
architectures that map musical events using a training objective at the
synthesis parameter level. Our technique can render note beginnings and endings
accurately and for a variety of sounds; these are essential steps for improving
musical articulation, phrasing, and sound diversity with Tone Transfer.
Finally, we implement a VST plugin for real-time live use and discuss
possibilities for improvement.
Related papers
- Creative Text-to-Audio Generation via Synthesizer Programming [1.1203110769488043]
We propose a text-to-audio generation method that leverages a virtual modular sound synthesizer with only 78 parameters.
Our method, CTAG, iteratively updates a synthesizer's parameters to produce high-quality audio renderings of text prompts.
arXiv Detail & Related papers (2024-06-01T04:08:31Z) - Qwen-Audio: Advancing Universal Audio Understanding via Unified
Large-Scale Audio-Language Models [98.34889301515412]
We develop the Qwen-Audio model and address the limitation by scaling up audio-language pre-training to cover over 30 tasks and various audio types.
Qwen-Audio achieves impressive performance across diverse benchmark tasks without requiring any task-specific fine-tuning.
We further develop Qwen-Audio-Chat, which allows for input from various audios and text inputs, enabling multi-turn dialogues and supporting various audio-central scenarios.
arXiv Detail & Related papers (2023-11-14T05:34:50Z) - Multitrack Music Transcription with a Time-Frequency Perceiver [6.617487928813374]
Multitrack music transcription aims to transcribe a music audio input into the musical notes of multiple instruments simultaneously.
We propose a novel deep neural network architecture, Perceiver TF, to model the time-frequency representation of audio input for multitrack transcription.
arXiv Detail & Related papers (2023-06-19T08:58:26Z) - AudioLM: a Language Modeling Approach to Audio Generation [59.19364975706805]
We introduce AudioLM, a framework for high-quality audio generation with long-term consistency.
We show how existing audio tokenizers provide different trade-offs between reconstruction quality and long-term structure.
We demonstrate how our approach extends beyond speech by generating coherent piano music continuations.
arXiv Detail & Related papers (2022-09-07T13:40:08Z) - Differentiable WORLD Synthesizer-based Neural Vocoder With Application
To End-To-End Audio Style Transfer [6.29475963948119]
We propose a differentiable WORLD synthesizer and demonstrate its use in end-to-end audio style transfer tasks.
Our baseline differentiable synthesizer has no model parameters, yet it yields adequate quality synthesis.
An alternative differentiable approach considers extraction of the source spectrum directly, which can improve naturalness.
arXiv Detail & Related papers (2022-08-15T15:48:36Z) - Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations.
We autoregressively output multiple possibilities of corresponding listener motion.
Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z) - Deep Performer: Score-to-Audio Music Performance Synthesis [30.95307878579825]
Deep Performer is a novel system for score-to-audio music performance synthesis.
Unlike speech, music often contains polyphony and long notes.
We show that our proposed model can synthesize music with clear polyphony and harmonic structures.
arXiv Detail & Related papers (2022-02-12T10:36:52Z) - Learning music audio representations via weak language supervision [14.335950077921435]
We design a multimodal architecture for music and language pre-training (MuLaP) optimised via a set of proxy tasks.
weak supervision is provided in the form of noisy natural language descriptions conveying the overall musical content of the track.
We demonstrate the usefulness of our approach by comparing the performance of audio representations produced by the same audio backbone with different training strategies.
arXiv Detail & Related papers (2021-12-08T10:30:52Z) - Strumming to the Beat: Audio-Conditioned Contrastive Video Textures [112.6140796961121]
We introduce a non-parametric approach for infinite video texture synthesis using a representation learned via contrastive learning.
We take inspiration from Video Textures, which showed that plausible new videos could be generated from a single one by stitching its frames together in a novel yet consistent order.
Our model outperforms baselines on human perceptual scores, can handle a diverse range of input videos, and can combine semantic and audio-visual cues in order to synthesize videos that synchronize well with an audio signal.
arXiv Detail & Related papers (2021-04-06T17:24:57Z) - A Deep Learning Approach for Low-Latency Packet Loss Concealment of
Audio Signals in Networked Music Performance Applications [66.56753488329096]
Networked Music Performance (NMP) is envisioned as a potential game changer among Internet applications.
This article describes a technique for predicting lost packet content in real-time using a deep learning approach.
arXiv Detail & Related papers (2020-07-14T15:51:52Z) - Music Gesture for Visual Sound Separation [121.36275456396075]
"Music Gesture" is a keypoint-based structured representation to explicitly model the body and finger movements of musicians when they perform music.
We first adopt a context-aware graph network to integrate visual semantic context with body dynamics, and then apply an audio-visual fusion model to associate body movements with the corresponding audio signals.
arXiv Detail & Related papers (2020-04-20T17:53:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.