The Concatenator: A Bayesian Approach To Real Time Concatenative Musaicing
- URL: http://arxiv.org/abs/2411.04366v1
- Date: Thu, 07 Nov 2024 01:52:46 GMT
- Title: The Concatenator: A Bayesian Approach To Real Time Concatenative Musaicing
- Authors: Christopher Tralie, Ben Cantil,
- Abstract summary: We present The Concatenator,'' a real time system for audio-guided concatenative synthesis.
We use a particle filter to infer the best corpus states in real-time.
Our system scales to corpora that are hours long, which is an important feature in the age of vast audio data collections.
- Score: 0.0
- License:
- Abstract: We present ``The Concatenator,'' a real time system for audio-guided concatenative synthesis. Similarly to Driedger et al.'s ``musaicing'' (or ``audio mosaicing'') technique, we concatenate a set number of windows within a corpus of audio to re-create the harmonic and percussive aspects of a target audio stream. Unlike Driedger's NMF-based technique, however, we instead use an explicitly Bayesian point of view, where corpus window indices are hidden states and the target audio stream is an observation. We use a particle filter to infer the best hidden corpus states in real-time. Our transition model includes a tunable parameter to control the time-continuity of corpus grains, and our observation model allows users to prioritize how quickly windows change to match the target. Because the computational complexity of the system is independent of the corpus size, our system scales to corpora that are hours long, which is an important feature in the age of vast audio data collections. Within The Concatenator module itself, composers can vary grain length, fit to target, and pitch shift in real time while reacting to the sounds they hear, enabling them to rapidly iterate ideas. To conclude our work, we evaluate our system with extensive quantitative tests of the effects of parameters, as well as a qualitative evaluation with artistic insights. Based on the quality of the results, we believe the real-time capability unlocks new avenues for musical expression and control, suitable for live performance and modular synthesis integration, which furthermore represents an essential breakthrough in concatenative synthesis technology.
Related papers
- Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities [67.89368528234394]
One of the main challenges of multimodal learning is the need to combine heterogeneous modalities.
Video and audio are obtained at much higher rates than text and are roughly aligned in time.
Our approach achieves the state-of-the-art on well established multimodal benchmarks, outperforming much larger models.
arXiv Detail & Related papers (2023-11-09T19:15:12Z) - High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks.
It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion.
We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z) - Multi-instrument Music Synthesis with Spectrogram Diffusion [19.81982315173444]
We focus on a middle ground of neural synthesizers that can generate audio from MIDI sequences with arbitrary combinations of instruments in realtime.
We use a simple two-stage process: MIDI to spectrograms with an encoder-decoder Transformer, then spectrograms to audio with a generative adversarial network (GAN) spectrogram inverter.
We find this to be a promising first step towards interactive and expressive neural synthesis for arbitrary combinations of instruments and notes.
arXiv Detail & Related papers (2022-06-11T03:26:15Z) - Singing-Tacotron: Global duration control attention and dynamic filter
for End-to-end singing voice synthesis [67.96138567288197]
This paper proposes an end-to-end singing voice synthesis framework, named Singing-Tacotron.
The main difference between the proposed framework and Tacotron is that the speech can be controlled significantly by the musical score's duration information.
arXiv Detail & Related papers (2022-02-16T07:35:17Z) - Deep Performer: Score-to-Audio Music Performance Synthesis [30.95307878579825]
Deep Performer is a novel system for score-to-audio music performance synthesis.
Unlike speech, music often contains polyphony and long notes.
We show that our proposed model can synthesize music with clear polyphony and harmonic structures.
arXiv Detail & Related papers (2022-02-12T10:36:52Z) - Neural Waveshaping Synthesis [0.0]
We present a novel, lightweight, fully causal approach to neural audio synthesis.
The Neural Waveshaping Unit (NEWT) operates directly in the waveform domain.
It produces complex timbral evolutions by simple affine transformations of its input and output signals.
arXiv Detail & Related papers (2021-07-11T13:50:59Z) - A Modulation Front-End for Music Audio Tagging [0.0]
Modulation filter bank representations have the potential to facilitate the extraction of perceptually salient features.
We explore end-to-end learned front-ends for audio representation learning, ModNet and SincModNet, that incorporate a temporal modulation processing block.
We evaluate the performance of our model against the state-of-the-art of music tagging on the MagnaTagATune dataset.
arXiv Detail & Related papers (2021-05-25T11:05:24Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - Incremental Text to Speech for Neural Sequence-to-Sequence Models using
Reinforcement Learning [60.20205278845412]
Modern approaches to text to speech require the entire input character sequence to be processed before any audio is synthesised.
This latency limits the suitability of such models for time-sensitive tasks like simultaneous interpretation.
We propose a reinforcement learning based framework to train an agent to make this decision.
arXiv Detail & Related papers (2020-08-07T11:48:05Z) - AutoFoley: Artificial Synthesis of Synchronized Sound Tracks for Silent
Videos with Deep Learning [5.33024001730262]
We present AutoFoley, a fully-automated deep learning tool that can be used to synthesize a representative audio track for videos.
AutoFoley can be used in the applications where there is either no corresponding audio file associated with the video or in cases where there is a need to identify critical scenarios.
Our experiments show that the synthesized sounds are realistically portrayed with accurate temporal synchronization of the associated visual inputs.
arXiv Detail & Related papers (2020-02-21T09:08:28Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.