Speaker-Adapted End-to-End Visual Speech Recognition for Continuous
Spanish
- URL: http://arxiv.org/abs/2311.12480v1
- Date: Tue, 21 Nov 2023 09:44:33 GMT
- Title: Speaker-Adapted End-to-End Visual Speech Recognition for Continuous
Spanish
- Authors: David Gimeno-G\'omez, Carlos-D. Mart\'inez-Hinarejos
- Abstract summary: This paper studies how estimation of specialized end-to-end systems for a specific person could affect the quality of speech recognition.
Results comparable to the current state of the art were reached even when only a limited amount of data was available.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Different studies have shown the importance of visual cues throughout the
speech perception process. In fact, the development of audiovisual approaches
has led to advances in the field of speech technologies. However, although
noticeable results have recently been achieved, visual speech recognition
remains an open research problem. It is a task in which, by dispensing with the
auditory sense, challenges such as visual ambiguities and the complexity of
modeling silence must be faced. Nonetheless, some of these challenges can be
alleviated when the problem is approached from a speaker-dependent perspective.
Thus, this paper studies, using the Spanish LIP-RTVE database, how the
estimation of specialized end-to-end systems for a specific person could affect
the quality of speech recognition. First, different adaptation strategies based
on the fine-tuning technique were proposed. Then, a pre-trained CTC/Attention
architecture was used as a baseline throughout our experiments. Our findings
showed that a two-step fine-tuning process, where the VSR system is first
adapted to the task domain, provided significant improvements when the speaker
adaptation was addressed. Furthermore, results comparable to the current state
of the art were reached even when only a limited amount of data was available.
Related papers
- LIP-RTVE: An Audiovisual Database for Continuous Spanish in the Wild [0.0]
This paper presents a semi-automatically annotated audiovisual database to deal with unconstrained natural Spanish.
Results for both speaker-dependent and speaker-independent scenarios are reported using Hidden Markov Models.
arXiv Detail & Related papers (2023-11-21T09:12:21Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Audio-visual multi-channel speech separation, dereverberation and
recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach.
The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches.
Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z) - Recent Progress in the CUHK Dysarthric Speech Recognition System [66.69024814159447]
Disordered speech presents a wide spectrum of challenges to current data intensive deep neural networks (DNNs) based automatic speech recognition technologies.
This paper presents recent research efforts at the Chinese University of Hong Kong to improve the performance of disordered speech recognition systems.
arXiv Detail & Related papers (2022-01-15T13:02:40Z) - Towards Intelligibility-Oriented Audio-Visual Speech Enhancement [8.19144665585397]
We present a fully convolutional AV SE model that uses a modified short-time objective intelligibility (STOI) metric as a training cost function.
Our proposed I-O AV SE framework outperforms audio-only (AO) and AV models trained with conventional distance-based loss functions.
arXiv Detail & Related papers (2021-11-18T11:47:37Z) - Conformer-Based Self-Supervised Learning for Non-Speech Audio Tasks [20.316239155843963]
We propose a self-supervised audio representation learning method and apply it to a variety of downstream non-speech audio tasks.
On the AudioSet benchmark, we achieve a mean average precision (mAP) score of 0.415, which is a new state-of-the-art on this dataset.
arXiv Detail & Related papers (2021-10-14T12:32:40Z) - Learning Audio-Visual Dereverberation [87.52880019747435]
Reverberation from audio reflecting off surfaces and objects in the environment not only degrades the quality of speech for human perception, but also severely impacts the accuracy of automatic speech recognition.
Our idea is to learn to dereverberate speech from audio-visual observations.
We introduce Visually-Informed Dereverberation of Audio (VIDA), an end-to-end approach that learns to remove reverberation based on both the observed sounds and visual scene.
arXiv Detail & Related papers (2021-06-14T20:01:24Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z) - Deep Audio-Visual Learning: A Survey [53.487938108404244]
We divide the current audio-visual learning tasks into four different subfields.
We discuss state-of-the-art methods as well as the remaining challenges of each subfield.
We summarize the commonly used datasets and performance metrics.
arXiv Detail & Related papers (2020-01-14T13:11:21Z) - Deep Representation Learning in Speech Processing: Challenges, Recent
Advances, and Future Trends [10.176394550114411]
The main contribution of this paper is to present an up-to-date and comprehensive survey on different techniques of speech representation learning.
Recent reviews in speech have been conducted for ASR, SR, and SER, however, none of these has focused on the representation learning from speech.
arXiv Detail & Related papers (2020-01-02T10:12:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.