Related papers: Speaker-Adapted End-to-End Visual Speech Recognition for Continuous Spanish

Speaker-Adapted End-to-End Visual Speech Recognition for Continuous Spanish

URL: http://arxiv.org/abs/2311.12480v1
Date: Tue, 21 Nov 2023 09:44:33 GMT
Title: Speaker-Adapted End-to-End Visual Speech Recognition for Continuous Spanish
Authors: David Gimeno-G\'omez, Carlos-D. Mart\'inez-Hinarejos
Abstract summary: This paper studies how estimation of specialized end-to-end systems for a specific person could affect the quality of speech recognition. Results comparable to the current state of the art were reached even when only a limited amount of data was available.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Different studies have shown the importance of visual cues throughout the speech perception process. In fact, the development of audiovisual approaches has led to advances in the field of speech technologies. However, although noticeable results have recently been achieved, visual speech recognition remains an open research problem. It is a task in which, by dispensing with the auditory sense, challenges such as visual ambiguities and the complexity of modeling silence must be faced. Nonetheless, some of these challenges can be alleviated when the problem is approached from a speaker-dependent perspective. Thus, this paper studies, using the Spanish LIP-RTVE database, how the estimation of specialized end-to-end systems for a specific person could affect the quality of speech recognition. First, different adaptation strategies based on the fine-tuning technique were proposed. Then, a pre-trained CTC/Attention architecture was used as a baseline throughout our experiments. Our findings showed that a two-step fine-tuning process, where the VSR system is first adapted to the task domain, provided significant improvements when the speaker adaptation was addressed. Furthermore, results comparable to the current state of the art were reached even when only a limited amount of data was available.

Related papers

Evaluation of End-to-End Continuous Spanish Lipreading in Different Data Conditions [0.0]
This paper presents noticeable advances in automatic continuous lipreading for Spanish. Experiments are conducted on two corpora of disparate nature, reaching state-of-the-art results. A rigorous error analysis is carried out to investigate the different factors that could affect the learning of the automatic system.
arXiv Detail & Related papers (2025-02-01T15:48:20Z)
LIP-RTVE: An Audiovisual Database for Continuous Spanish in the Wild [0.0]
This paper presents a semi-automatically annotated audiovisual database to deal with unconstrained natural Spanish. Results for both speaker-dependent and speaker-independent scenarios are reported using Hidden Markov Models.
arXiv Detail & Related papers (2023-11-21T09:12:21Z)
Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains. Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods. This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z)
Audio-visual multi-channel speech separation, dereverberation and recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach. The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches. Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z)
Recent Progress in the CUHK Dysarthric Speech Recognition System [66.69024814159447]
Disordered speech presents a wide spectrum of challenges to current data intensive deep neural networks (DNNs) based automatic speech recognition technologies. This paper presents recent research efforts at the Chinese University of Hong Kong to improve the performance of disordered speech recognition systems.
arXiv Detail & Related papers (2022-01-15T13:02:40Z)
Towards Intelligibility-Oriented Audio-Visual Speech Enhancement [8.19144665585397]
We present a fully convolutional AV SE model that uses a modified short-time objective intelligibility (STOI) metric as a training cost function. Our proposed I-O AV SE framework outperforms audio-only (AO) and AV models trained with conventional distance-based loss functions.
arXiv Detail & Related papers (2021-11-18T11:47:37Z)
Conformer-Based Self-Supervised Learning for Non-Speech Audio Tasks [20.316239155843963]
We propose a self-supervised audio representation learning method and apply it to a variety of downstream non-speech audio tasks. On the AudioSet benchmark, we achieve a mean average precision (mAP) score of 0.415, which is a new state-of-the-art on this dataset.
arXiv Detail & Related papers (2021-10-14T12:32:40Z)
Learning Audio-Visual Dereverberation [87.52880019747435]
Reverberation from audio reflecting off surfaces and objects in the environment not only degrades the quality of speech for human perception, but also severely impacts the accuracy of automatic speech recognition. Our idea is to learn to dereverberate speech from audio-visual observations. We introduce Visually-Informed Dereverberation of Audio (VIDA), an end-to-end approach that learns to remove reverberation based on both the observed sounds and visual scene.
arXiv Detail & Related papers (2021-06-14T20:01:24Z)
An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks. Traditionally, these tasks have been tackled using signal processing and machine learning techniques. Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z)
Deep Audio-Visual Learning: A Survey [53.487938108404244]
We divide the current audio-visual learning tasks into four different subfields. We discuss state-of-the-art methods as well as the remaining challenges of each subfield. We summarize the commonly used datasets and performance metrics.
arXiv Detail & Related papers (2020-01-14T13:11:21Z)
Deep Representation Learning in Speech Processing: Challenges, Recent Advances, and Future Trends [10.176394550114411]
The main contribution of this paper is to present an up-to-date and comprehensive survey on different techniques of speech representation learning. Recent reviews in speech have been conducted for ASR, SR, and SER, however, none of these has focused on the representation learning from speech.
arXiv Detail & Related papers (2020-01-02T10:12:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.