Knowing What to Listen to: Early Attention for Deep Speech
Representation Learning
- URL: http://arxiv.org/abs/2009.01822v1
- Date: Thu, 3 Sep 2020 17:40:27 GMT
- Title: Knowing What to Listen to: Early Attention for Deep Speech
Representation Learning
- Authors: Amirhossein Hajavi, Ali Etemad
- Abstract summary: We propose the novel Fine-grained Early Attention (FEFA) for speech signals.
This model is capable of focusing on information items as small as frequency bins.
We evaluate the proposed model on two popular tasks of speaker recognition and speech emotion recognition.
- Score: 25.71206255965502
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Deep learning techniques have considerably improved speech processing in
recent years. Speech representations extracted by deep learning models are
being used in a wide range of tasks such as speech recognition, speaker
recognition, and speech emotion recognition. Attention models play an important
role in improving deep learning models. However current attention mechanisms
are unable to attend to fine-grained information items. In this paper we
propose the novel Fine-grained Early Frequency Attention (FEFA) for speech
signals. This model is capable of focusing on information items as small as
frequency bins. We evaluate the proposed model on two popular tasks of speaker
recognition and speech emotion recognition. Two widely used public datasets,
VoxCeleb and IEMOCAP, are used for our experiments. The model is implemented on
top of several prominent deep models as backbone networks to evaluate its
impact on performance compared to the original networks and other related work.
Our experiments show that by adding FEFA to different CNN architectures,
performance is consistently improved by substantial margins, even setting a new
state-of-the-art for the speaker recognition task. We also tested our model
against different levels of added noise showing improvements in robustness and
less sensitivity compared to the backbone networks.
Related papers
- A Closer Look at Wav2Vec2 Embeddings for On-Device Single-Channel Speech
Enhancement [16.900731393703648]
Self-supervised learned models have been found to be very effective for certain speech tasks.
In this paper, we investigate the uses of SSL representations for single-channel speech enhancement in challenging conditions.
arXiv Detail & Related papers (2024-03-03T02:05:17Z) - Exploring Speech Recognition, Translation, and Understanding with
Discrete Speech Units: A Comparative Study [68.88536866933038]
Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies.
Recent investigations proposed the use of discrete speech units derived from self-supervised learning representations.
Applying various methods, such as de-duplication and subword modeling, can further compress the speech sequence length.
arXiv Detail & Related papers (2023-09-27T17:21:13Z) - Speaker Recognition in Realistic Scenario Using Multimodal Data [4.373374186532439]
We propose a two-branch network to learn joint representations of faces and voices in a multimodal system.
We evaluate our proposed framework on a large scale audio-visual dataset named VoxCeleb$1$.
arXiv Detail & Related papers (2023-02-25T09:11:09Z) - A Systematic Comparison of Phonetic Aware Techniques for Speech
Enhancement [20.329872147913584]
We compare different methods of incorporating phonetic information in a speech enhancement model.
We observe the influence of different phonetic content models as well as various feature-injection techniques on enhancement performance.
arXiv Detail & Related papers (2022-06-22T12:00:50Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Self-Supervised Learning for speech recognition with Intermediate layer
supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL)
ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers.
Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z) - Personalized Speech Enhancement: New Models and Comprehensive Evaluation [27.572537325449158]
We propose two neural networks for personalized speech enhancement (PSE) models that achieve superior performance to the previously proposed VoiceFilter.
We also create test sets that capture a variety of scenarios that users can encounter during video conferencing.
Our results show that the proposed models can yield better speech recognition accuracy, speech intelligibility, and perceptual quality than the baseline models.
arXiv Detail & Related papers (2021-10-18T21:21:23Z) - Learning Audio-Visual Dereverberation [87.52880019747435]
Reverberation from audio reflecting off surfaces and objects in the environment not only degrades the quality of speech for human perception, but also severely impacts the accuracy of automatic speech recognition.
Our idea is to learn to dereverberate speech from audio-visual observations.
We introduce Visually-Informed Dereverberation of Audio (VIDA), an end-to-end approach that learns to remove reverberation based on both the observed sounds and visual scene.
arXiv Detail & Related papers (2021-06-14T20:01:24Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z) - Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio
Representation [51.37980448183019]
We propose Audio ALBERT, a lite version of the self-supervised speech representation model.
We show that Audio ALBERT is capable of achieving competitive performance with those huge models in the downstream tasks.
In probing experiments, we find that the latent representations encode richer information of both phoneme and speaker than that of the last layer.
arXiv Detail & Related papers (2020-05-18T10:42:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.