Point Cloud Audio Processing
- URL: http://arxiv.org/abs/2105.02469v1
- Date: Thu, 6 May 2021 07:04:59 GMT
- Title: Point Cloud Audio Processing
- Authors: Krishna Subramani, Paris Smaragdis
- Abstract summary: We introduce a novel way of processing audio signals by treating them as a collection of points in feature space.
We observe that these methods result in smaller models, and allow us to significantly subsample the input representation with minimal effects to a trained model performance.
- Score: 18.88427891844357
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most audio processing pipelines involve transformations that act on
fixed-dimensional input representations of audio. For example, when using the
Short Time Fourier Transform (STFT) the DFT size specifies a fixed dimension
for the input representation. As a consequence, most audio machine learning
models are designed to process fixed-size vector inputs which often prohibits
the repurposing of learned models on audio with different sampling rates or
alternative representations. We note, however, that the intrinsic spectral
information in the audio signal is invariant to the choice of the input
representation or the sampling rate. Motivated by this, we introduce a novel
way of processing audio signals by treating them as a collection of points in
feature space, and we use point cloud machine learning models that give us
invariance to the choice of representation parameters, such as DFT size or the
sampling rate. Additionally, we observe that these methods result in smaller
models, and allow us to significantly subsample the input representation with
minimal effects to a trained model performance.
Related papers
- Autoregressive Diffusion Transformer for Text-to-Speech Synthesis [39.32761051774537]
We propose encoding audio as vector sequences in continuous space $mathbb Rd$ and autoregressively generating these sequences.
High-bitrate continuous speech representation enables almost flawless reconstruction, allowing our model to achieve nearly perfect speech editing.
arXiv Detail & Related papers (2024-06-08T18:57:13Z) - DPATD: Dual-Phase Audio Transformer for Denoising [25.097894984130733]
We propose a dual-phase audio transformer for denoising (DPATD), a novel model to organize transformer layers in a deep structure to learn clean audio sequences for denoising.
Our memory-compressed explainable attention is efficient and converges faster compared to the frequently used self-attention module.
arXiv Detail & Related papers (2023-10-30T14:44:59Z) - TokenSplit: Using Discrete Speech Representations for Direct, Refined,
and Transcript-Conditioned Speech Separation and Recognition [51.565319173790314]
TokenSplit is a sequence-to-sequence encoder-decoder model that uses the Transformer architecture.
We show that our model achieves excellent performance in terms of separation, both with or without transcript conditioning.
We also measure the automatic speech recognition (ASR) performance and provide audio samples of speech synthesis to demonstrate the additional utility of our model.
arXiv Detail & Related papers (2023-08-21T01:52:01Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - Modulation Extraction for LFO-driven Audio Effects [5.740770499256802]
We propose a framework capable of extracting arbitrary LFO signals from processed audio across multiple digital audio effects, parameter settings, and instrument configurations.
We show how coupling the extraction model with a simple processing network enables training of end-to-end black-box models of unseen analog or digital LFO-driven audio effects.
We make our code available and provide the trained audio effect models in a real-time VST plugin.
arXiv Detail & Related papers (2023-05-22T17:33:07Z) - Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition [66.94463981654216]
We propose prompt tuning methods of Deep Neural Networks (DNNs) for speaker-adaptive Visual Speech Recognition (VSR)
We finetune prompts on adaptation data of target speakers instead of modifying the pre-trained model parameters.
The effectiveness of the proposed method is evaluated on both word- and sentence-level VSR databases.
arXiv Detail & Related papers (2023-02-16T06:01:31Z) - Diffusion-Based Representation Learning [65.55681678004038]
We augment the denoising score matching framework to enable representation learning without any supervised signal.
In contrast, the introduced diffusion-based representation learning relies on a new formulation of the denoising score matching objective.
Using the same approach, we propose to learn an infinite-dimensional latent code that achieves improvements of state-of-the-art models on semi-supervised image classification.
arXiv Detail & Related papers (2021-05-29T09:26:02Z) - End-to-end Audio-visual Speech Recognition with Conformers [65.30276363777514]
We present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer)
In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms.
We show that our proposed models raise the state-of-the-art performance by a large margin in audio-only, visual-only, and audio-visual experiments.
arXiv Detail & Related papers (2021-02-12T18:00:08Z) - Switching Variational Auto-Encoders for Noise-Agnostic Audio-visual
Speech Enhancement [26.596930749375474]
We introduce the use of a latent sequential variable with Markovian dependencies to switch between different VAE architectures through time.
We derive the corresponding variational expectation-maximization algorithm to estimate the parameters of the model and enhance the speech signal.
arXiv Detail & Related papers (2021-02-08T11:45:02Z) - VaPar Synth -- A Variational Parametric Model for Audio Synthesis [78.3405844354125]
We present VaPar Synth - a Variational Parametric Synthesizer which utilizes a conditional variational autoencoder (CVAE) trained on a suitable parametric representation.
We demonstrate our proposed model's capabilities via the reconstruction and generation of instrumental tones with flexible control over their pitch.
arXiv Detail & Related papers (2020-03-30T16:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.