Complementing Handcrafted Features with Raw Waveform Using a
Light-weight Auxiliary Model
- URL: http://arxiv.org/abs/2109.02773v1
- Date: Mon, 6 Sep 2021 23:32:10 GMT
- Title: Complementing Handcrafted Features with Raw Waveform Using a
Light-weight Auxiliary Model
- Authors: Zhongwei Teng, Quchen Fu, Jules White, Maria Powell, Douglas C.
Schmidt
- Abstract summary: An emerging trend in audio processing is capturing low-level speech representations from raw waveforms.
This paper proposes an Auxiliary Rawnet model to complement handcrafted features with features learned from raw waveforms.
A key benefit of the approach is that it can improve accuracy at a relatively low computational cost.
- Score: 1.7149364927872013
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: An emerging trend in audio processing is capturing low-level speech
representations from raw waveforms. These representations have shown promising
results on a variety of tasks, such as speech recognition and speech
separation. Compared to handcrafted features, learning speech features via
backpropagation provides the model greater flexibility in how it represents
data for different tasks theoretically. However, results from empirical study
shows that, in some tasks, such as voice spoof detection, handcrafted features
are more competitive than learned features. Instead of evaluating handcrafted
features and raw waveforms independently, this paper proposes an Auxiliary
Rawnet model to complement handcrafted features with features learned from raw
waveforms. A key benefit of the approach is that it can improve accuracy at a
relatively low computational cost. The proposed Auxiliary Rawnet model is
tested using the ASVspoof 2019 dataset and the results from this dataset
indicate that a light-weight waveform encoder can potentially boost the
performance of handcrafted-features-based encoders in exchange for a small
amount of additional computational work.
Related papers
- Noise-Resilient Unsupervised Graph Representation Learning via Multi-Hop Feature Quality Estimation [53.91958614666386]
Unsupervised graph representation learning (UGRL) based on graph neural networks (GNNs)
We propose a novel UGRL method based on Multi-hop feature Quality Estimation (MQE)
arXiv Detail & Related papers (2024-07-29T12:24:28Z) - Toward end-to-end interpretable convolutional neural networks for waveform signals [0.7499722271664147]
This paper introduces a novel convolutional neural networks (CNN) framework tailored for end-to-end audio deep learning models.
By benchmarking experiments on three standard speech emotion recognition datasets with five-fold cross-validation, our framework outperforms Mel spectrogram features by up to seven percent.
arXiv Detail & Related papers (2024-05-03T02:24:27Z) - Feature Normalization for Fine-tuning Self-Supervised Models in Speech
Enhancement [19.632358491434697]
Large, pre-trained representation models trained using self-supervised learning have gained popularity in various fields of machine learning.
In this paper, we investigate the feasibility of using pre-trained speech representation models for a downstream speech enhancement task.
Our proposed method enables significant improvements in speech quality compared to baselines when combined with various types of pre-trained speech models.
arXiv Detail & Related papers (2023-06-14T10:03:33Z) - Adaptive re-calibration of channel-wise features for Adversarial Audio
Classification [0.0]
We propose a recalibration of features using attention feature fusion for synthetic speech detection.
We compare its performance against different detection methods including End2End models and Resnet-based models.
We also demonstrate that the combination of Linear frequency cepstral coefficients (LFCC) and Mel Frequency cepstral coefficients (MFCC) using the attentional feature fusion technique creates better input features representations.
arXiv Detail & Related papers (2022-10-21T04:21:56Z) - Dynamic Latent Separation for Deep Learning [67.62190501599176]
A core problem in machine learning is to learn expressive latent variables for model prediction on complex data.
Here, we develop an approach that improves expressiveness, provides partial interpretation, and is not restricted to specific applications.
arXiv Detail & Related papers (2022-10-07T17:56:53Z) - BYOL-S: Learning Self-supervised Speech Representations by Bootstrapping [19.071463356974387]
This work extends existing methods based on self-supervised learning by bootstrapping, proposes various encoder architectures, and explores the effects of using different pre-training datasets.
We present a novel training framework to come up with a hybrid audio representation, which combines handcrafted and data-driven learned audio features.
All the proposed representations were evaluated within the HEAR NeurIPS 2021 challenge for auditory scene classification and timestamp detection tasks.
arXiv Detail & Related papers (2022-06-24T02:26:40Z) - An Empirical Investigation of Commonsense Self-Supervision with
Knowledge Graphs [67.23285413610243]
Self-supervision based on the information extracted from large knowledge graphs has been shown to improve the generalization of language models.
We study the effect of knowledge sampling strategies and sizes that can be used to generate synthetic data for adapting language models.
arXiv Detail & Related papers (2022-05-21T19:49:04Z) - Bridging the Gap Between Clean Data Training and Real-World Inference
for Spoken Language Understanding [76.89426311082927]
Existing models are trained on clean data, which causes a textitgap between clean data training and real-world inference.
We propose a method from the perspective of domain adaptation, by which both high- and low-quality samples are embedding into similar vector space.
Experiments on the widely-used dataset, Snips, and large scale in-house dataset (10 million training examples) demonstrate that this method not only outperforms the baseline models on real-world (noisy) corpus but also enhances the robustness, that is, it produces high-quality results under a noisy environment.
arXiv Detail & Related papers (2021-04-13T17:54:33Z) - COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio
Representations [32.456824945999465]
We propose a method for learning audio representations, aligning the learned latent representations of audio and associated tags.
We evaluate the quality of our embedding model, measuring its performance as a feature extractor on three different tasks.
arXiv Detail & Related papers (2020-06-15T13:17:18Z) - Learning What Makes a Difference from Counterfactual Examples and
Gradient Supervision [57.14468881854616]
We propose an auxiliary training objective that improves the generalization capabilities of neural networks.
We use pairs of minimally-different examples with different labels, a.k.a counterfactual or contrasting examples, which provide a signal indicative of the underlying causal structure of the task.
Models trained with this technique demonstrate improved performance on out-of-distribution test sets.
arXiv Detail & Related papers (2020-04-20T02:47:49Z) - Audio Impairment Recognition Using a Correlation-Based Feature
Representation [85.08880949780894]
We propose a new representation of hand-crafted features that is based on the correlation of feature pairs.
We show superior performance in terms of compact feature dimensionality and improved computational speed in the test stage.
arXiv Detail & Related papers (2020-03-22T13:34:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.