Content Adaptive Front End For Audio Signal Processing
- URL: http://arxiv.org/abs/2303.10446v2
- Date: Sat, 29 Apr 2023 14:54:47 GMT
- Title: Content Adaptive Front End For Audio Signal Processing
- Authors: Prateek Verma and Chris Chafe
- Abstract summary: We propose a learnable content adaptive front end for audio signal processing.
We pass each audio signal through a bank of convolutional filters, each giving a fixed-dimensional vector.
- Score: 2.8935588665357077
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a learnable content adaptive front end for audio signal
processing. Before the modern advent of deep learning, we used fixed
representation non-learnable front-ends like spectrogram or mel-spectrogram
with/without neural architectures. With convolutional architectures supporting
various applications such as ASR and acoustic scene understanding, a shift to a
learnable front ends occurred in which both the type of basis functions and the
weight were learned from scratch and optimized for the particular task of
interest. With the shift to transformer-based architectures with no
convolutional blocks present, a linear layer projects small waveform patches
onto a small latent dimension before feeding them to a transformer
architecture. In this work, we propose a way of computing a content-adaptive
learnable time-frequency representation. We pass each audio signal through a
bank of convolutional filters, each giving a fixed-dimensional vector. It is
akin to learning a bank of finite impulse-response filterbanks and passing the
input signal through the optimum filter bank depending on the content of the
input signal. A content-adaptive learnable time-frequency representation may be
more broadly applicable, beyond the experiments in this paper.
Related papers
- VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired.
We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z) - Towards Signal Processing In Large Language Models [46.76681147411957]
This paper introduces the idea of applying signal processing inside a Large Language Model (LLM)
We draw parallels between classical Fourier-Transforms and Fourier Transform-like learnable time-frequency representations.
We show that for GPT-like architectures, our work achieves faster convergence and significantly increases performance.
arXiv Detail & Related papers (2024-06-10T13:51:52Z) - Neural Architectures Learning Fourier Transforms, Signal Processing and
Much More.... [1.2328446298523066]
We show how one can learn kernels from scratch for audio signal processing applications.
We find that the neural architecture not only learns sinusoidal kernel shapes but discovers all kinds of incredible signal-processing properties.
arXiv Detail & Related papers (2023-08-20T23:30:27Z) - High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks.
It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion.
We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z) - Interpretable Acoustic Representation Learning on Breathing and Speech
Signals for COVID-19 Detection [37.01066509527848]
We describe an approach for representation learning of audio signals for the task of COVID-19 detection.
The raw audio samples are processed with a bank of 1-D convolutional filters that are parameterized as cosine modulated Gaussian functions.
The filtered outputs are pooled, log-compressed and used in a self-attention based relevance weighting mechanism.
arXiv Detail & Related papers (2022-06-27T15:20:51Z) - A Modulation Front-End for Music Audio Tagging [0.0]
Modulation filter bank representations have the potential to facilitate the extraction of perceptually salient features.
We explore end-to-end learned front-ends for audio representation learning, ModNet and SincModNet, that incorporate a temporal modulation processing block.
We evaluate the performance of our model against the state-of-the-art of music tagging on the MagnaTagATune dataset.
arXiv Detail & Related papers (2021-05-25T11:05:24Z) - Audio Transformers:Transformer Architectures For Large Scale Audio
Understanding. Adieu Convolutions [6.370905925442655]
We propose applying Transformer based architectures without convolutional layers to raw audio signals.
Our model outperforms convolutional models to produce state of the art results.
We further improve the performance of Transformer architectures by using techniques such as pooling inspired from convolutional net-work.
arXiv Detail & Related papers (2021-05-01T19:38:30Z) - End-to-end Audio-visual Speech Recognition with Conformers [65.30276363777514]
We present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer)
In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms.
We show that our proposed models raise the state-of-the-art performance by a large margin in audio-only, visual-only, and audio-visual experiments.
arXiv Detail & Related papers (2021-02-12T18:00:08Z) - Ensemble Wrapper Subsampling for Deep Modulation Classification [70.91089216571035]
Subsampling of received wireless signals is important for relaxing hardware requirements as well as the computational cost of signal processing algorithms.
We propose a subsampling technique to facilitate the use of deep learning for automatic modulation classification in wireless communication systems.
arXiv Detail & Related papers (2020-05-10T06:11:13Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.