Deep Representation Learning in Speech Processing: Challenges, Recent
Advances, and Future Trends
- URL: http://arxiv.org/abs/2001.00378v2
- Date: Fri, 24 Sep 2021 05:09:30 GMT
- Title: Deep Representation Learning in Speech Processing: Challenges, Recent
Advances, and Future Trends
- Authors: Siddique Latif, Rajib Rana, Sara Khalifa, Raja Jurdak, Junaid Qadir,
and Bj\"orn W. Schuller
- Abstract summary: The main contribution of this paper is to present an up-to-date and comprehensive survey on different techniques of speech representation learning.
Recent reviews in speech have been conducted for ASR, SR, and SER, however, none of these has focused on the representation learning from speech.
- Score: 10.176394550114411
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Research on speech processing has traditionally considered the task of
designing hand-engineered acoustic features (feature engineering) as a separate
distinct problem from the task of designing efficient machine learning (ML)
models to make prediction and classification decisions. There are two main
drawbacks to this approach: firstly, the feature engineering being manual is
cumbersome and requires human knowledge; and secondly, the designed features
might not be best for the objective at hand. This has motivated the adoption of
a recent trend in speech community towards utilisation of representation
learning techniques, which can learn an intermediate representation of the
input signal automatically that better suits the task at hand and hence lead to
improved performance. The significance of representation learning has increased
with advances in deep learning (DL), where the representations are more useful
and less dependent on human knowledge, making it very conducive for tasks like
classification, prediction, etc. The main contribution of this paper is to
present an up-to-date and comprehensive survey on different techniques of
speech representation learning by bringing together the scattered research
across three distinct research areas including Automatic Speech Recognition
(ASR), Speaker Recognition (SR), and Speaker Emotion Recognition (SER). Recent
reviews in speech have been conducted for ASR, SR, and SER, however, none of
these has focused on the representation learning from speech -- a gap that our
survey aims to bridge.
Related papers
- Unsupervised Representations Improve Supervised Learning in Speech
Emotion Recognition [1.3812010983144798]
This study proposes an innovative approach that integrates self-supervised feature extraction with supervised classification for emotion recognition from small audio segments.
In the preprocessing step, we employed a self-supervised feature extractor, based on the Wav2Vec model, to capture acoustic features from audio data.
Then, the output featuremaps of the preprocessing step are fed to a custom designed Convolutional Neural Network (CNN)-based model to perform emotion classification.
arXiv Detail & Related papers (2023-09-22T08:54:06Z) - Label Aware Speech Representation Learning For Language Identification [49.197215416945596]
We propose a novel framework of combining self-supervised representation learning with the language label information for the pre-training task.
This framework, termed as Label Aware Speech Representation (LASR) learning, uses a triplet based objective function to incorporate language labels along with the self-supervised loss function.
arXiv Detail & Related papers (2023-06-07T12:14:16Z) - Towards Disentangled Speech Representations [65.7834494783044]
We construct a representation learning task based on joint modeling of ASR and TTS.
We seek to learn a representation of audio that disentangles that part of the speech signal that is relevant to transcription from that part which is not.
We show that enforcing these properties during training improves WER by 24.5% relative on average for our joint modeling task.
arXiv Detail & Related papers (2022-08-28T10:03:55Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Survey on Automated Short Answer Grading with Deep Learning: from Word
Embeddings to Transformers [5.968260239320591]
Automated short answer grading (ASAG) has gained attention in education as a means to scale educational tasks to the growing number of students.
Recent progress in Natural Language Processing and Machine Learning has largely influenced the field of ASAG.
arXiv Detail & Related papers (2022-03-11T13:47:08Z) - Visualizing Automatic Speech Recognition -- Means for a Better
Understanding? [0.1868368163807795]
We show how attribution methods, that we import from image recognition and suitably adapt to handle audio data, can help to clarify the working of ASR.
Taking Speech Deep, an end-to-end model for ASR, as a case study, we show how these techniques help to visualize which features of the input are the most influential in determining the output.
arXiv Detail & Related papers (2022-02-01T13:35:08Z) - An Exploration of Self-Supervised Pretrained Representations for
End-to-End Speech Recognition [98.70304981174748]
We focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models.
We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR.
arXiv Detail & Related papers (2021-10-09T15:06:09Z) - An Attribute-Aligned Strategy for Learning Speech Representation [57.891727280493015]
We propose an attribute-aligned learning strategy to derive speech representation that can flexibly address these issues by attribute-selection mechanism.
Specifically, we propose a layered-representation variational autoencoder (LR-VAE), which factorizes speech representation into attribute-sensitive nodes.
Our proposed method achieves competitive performances on identity-free SER and a better performance on emotionless SV.
arXiv Detail & Related papers (2021-06-05T06:19:14Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z) - Multi-Task Learning with Auxiliary Speaker Identification for
Conversational Emotion Recognition [32.439818455554885]
We exploit speaker identification (SI) as an auxiliary task to enhance the utterance representation in conversations.
By this method, we can learn better speaker-aware contextual representations from the additional SI corpus.
Experiments on two benchmark datasets demonstrate that the proposed architecture is highly effective for CER.
arXiv Detail & Related papers (2020-03-03T12:25:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.