Analysis of Visual Features for Continuous Lipreading in Spanish
- URL: http://arxiv.org/abs/2311.12468v1
- Date: Tue, 21 Nov 2023 09:28:00 GMT
- Title: Analysis of Visual Features for Continuous Lipreading in Spanish
- Authors: David Gimeno-G\'omez, Carlos-D. Mart\'inez-Hinarejos
- Abstract summary: lipreading is a complex task whose objective is to interpret speech when audio is not available.
We propose an analysis of different speech visual features with the intention of identifying which of them is the best approach to capture the nature of lip movements for natural Spanish.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: During a conversation, our brain is responsible for combining information
obtained from multiple senses in order to improve our ability to understand the
message we are perceiving. Different studies have shown the importance of
presenting visual information in these situations. Nevertheless, lipreading is
a complex task whose objective is to interpret speech when audio is not
available. By dispensing with a sense as crucial as hearing, it will be
necessary to be aware of the challenge that this lack presents. In this paper,
we propose an analysis of different speech visual features with the intention
of identifying which of them is the best approach to capture the nature of lip
movements for natural Spanish and, in this way, dealing with the automatic
visual speech recognition task. In order to estimate our system, we present an
audiovisual corpus compiled from a subset of the RTVE database, which has been
used in the Albayz\'in evaluations. We employ a traditional system based on
Hidden Markov Models with Gaussian Mixture Models. Results show that, although
the task is difficult, in restricted conditions we obtain recognition results
which determine that using eigenlips in combination with deep features is the
best visual approach.
Related papers
- Cooperative Dual Attention for Audio-Visual Speech Enhancement with
Facial Cues [80.53407593586411]
We focus on leveraging facial cues beyond the lip region for robust Audio-Visual Speech Enhancement (AVSE)
We propose a Dual Attention Cooperative Framework, DualAVSE, to ignore speech-unrelated information, capture speech-related information with facial cues, and dynamically integrate it with the audio signal for AVSE.
arXiv Detail & Related papers (2023-11-24T04:30:31Z) - Integrating Language-Derived Appearance Elements with Visual Cues in Pedestrian Detection [51.66174565170112]
We introduce a novel approach to utilize the strengths of large language models in understanding contextual appearance variations.
We propose to formulate language-derived appearance elements and incorporate them with visual cues in pedestrian detection.
arXiv Detail & Related papers (2023-11-02T06:38:19Z) - Leveraging Visemes for Better Visual Speech Representation and Lip
Reading [2.7836084563851284]
We propose a novel approach that leverages visemes, which are groups of phonetically similar lip shapes, to extract more discriminative and robust video features for lip reading.
The proposed method reduces the lip-reading word error rate (WER) by 9.1% relative to the best previous method.
arXiv Detail & Related papers (2023-07-19T17:38:26Z) - VILAS: Exploring the Effects of Vision and Language Context in Automatic
Speech Recognition [18.19998336526969]
ViLaS (Vision and Language into Automatic Speech Recognition) is a novel multimodal ASR model based on the continuous integrate-and-fire (CIF) mechanism.
To explore the effects of integrating vision and language, we create VSDial, a multimodal ASR dataset with multimodal context cues in both Chinese and English versions.
arXiv Detail & Related papers (2023-05-31T16:01:20Z) - Language-Guided Audio-Visual Source Separation via Trimodal Consistency [64.0580750128049]
A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform.
We adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions.
We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets.
arXiv Detail & Related papers (2023-03-28T22:45:40Z) - Language identification as improvement for lip-based biometric visual
systems [13.205817167773443]
We present a preliminary study in which we use linguistic information as a soft biometric trait to enhance the performance of a visual (auditory-free) identification system based on lip movement.
We report a significant improvement in the identification performance of the proposed visual system as a result of the integration of these data.
arXiv Detail & Related papers (2023-02-27T15:44:24Z) - Towards Disentangled Speech Representations [65.7834494783044]
We construct a representation learning task based on joint modeling of ASR and TTS.
We seek to learn a representation of audio that disentangles that part of the speech signal that is relevant to transcription from that part which is not.
We show that enforcing these properties during training improves WER by 24.5% relative on average for our joint modeling task.
arXiv Detail & Related papers (2022-08-28T10:03:55Z) - Lip-Listening: Mixing Senses to Understand Lips using Cross Modality
Knowledge Distillation for Word-Based Models [0.03499870393443267]
This work builds on recent state-of-the-art word-based lipreading models by integrating sequence-level and frame-level Knowledge Distillation (KD) to their systems.
We propose a technique to transfer speech recognition capabilities from audio speech recognition systems to visual speech recognizers, where our goal is to utilize audio data during lipreading model training.
arXiv Detail & Related papers (2022-06-05T15:47:54Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z) - Disentangled Speech Embeddings using Cross-modal Self-supervision [119.94362407747437]
We develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video.
We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors.
arXiv Detail & Related papers (2020-02-20T14:13:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.