Related papers: Evaluation of End-to-End Continuous Spanish Lipreading in Different Data Conditions

Evaluation of End-to-End Continuous Spanish Lipreading in Different Data Conditions

URL: http://arxiv.org/abs/2502.00464v2
Date: Mon, 17 Feb 2025 14:44:05 GMT
Title: Evaluation of End-to-End Continuous Spanish Lipreading in Different Data Conditions
Authors: David Gimeno-Gómez, Carlos-D. Martínez-Hinarejos,
Abstract summary: This paper presents noticeable advances in automatic continuous lipreading for Spanish.<n> Experiments are conducted on two corpora of disparate nature, reaching state-of-the-art results.<n>A rigorous error analysis is carried out to investigate the different factors that could affect the learning of the automatic system.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Visual speech recognition remains an open research problem where different challenges must be considered by dispensing with the auditory sense, such as visual ambiguities, the inter-personal variability among speakers, and the complex modeling of silence. Nonetheless, recent remarkable results have been achieved in the field thanks to the availability of large-scale databases and the use of powerful attention mechanisms. Besides, multiple languages apart from English are nowadays a focus of interest. This paper presents noticeable advances in automatic continuous lipreading for Spanish. First, an end-to-end system based on the hybrid CTC/Attention architecture is presented. Experiments are conducted on two corpora of disparate nature, reaching state-of-the-art results that significantly improve the best performance obtained to date for both databases. In addition, a thorough ablation study is carried out, where it is studied how the different components that form the architecture influence the quality of speech recognition. Then, a rigorous error analysis is carried out to investigate the different factors that could affect the learning of the automatic system. Finally, a new Spanish lipreading benchmark is consolidated. Code and trained models are available at https://github.com/david-gimeno/evaluating-end2end-spanish-lipreading.

Related papers

Untangling the Influence of Typology, Data and Model Architecture on Ranking Transfer Languages for Cross-Lingual POS Tagging [1.743739675178893]
Cross-lingual transfer learning is an invaluable tool for overcoming data scarcity. The precise roles of linguistic typology, training data, and model architecture in transfer language choice are not fully understood. We take a holistic approach, examining how both dataset-specific and fine-grained typological features influence transfer language selection.
arXiv Detail & Related papers (2025-03-25T18:05:40Z)
Speaker-Adapted End-to-End Visual Speech Recognition for Continuous Spanish [0.0]
This paper studies how estimation of specialized end-to-end systems for a specific person could affect the quality of speech recognition. Results comparable to the current state of the art were reached even when only a limited amount of data was available.
arXiv Detail & Related papers (2023-11-21T09:44:33Z)
LIP-RTVE: An Audiovisual Database for Continuous Spanish in the Wild [0.0]
This paper presents a semi-automatically annotated audiovisual database to deal with unconstrained natural Spanish. Results for both speaker-dependent and speaker-independent scenarios are reported using Hidden Markov Models.
arXiv Detail & Related papers (2023-11-21T09:12:21Z)
Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z)
Language identification as improvement for lip-based biometric visual systems [13.205817167773443]
We present a preliminary study in which we use linguistic information as a soft biometric trait to enhance the performance of a visual (auditory-free) identification system based on lip movement. We report a significant improvement in the identification performance of the proposed visual system as a result of the integration of these data.
arXiv Detail & Related papers (2023-02-27T15:44:24Z)
Investigating Fairness Disparities in Peer Review: A Language Model Enhanced Approach [77.61131357420201]
We conduct a thorough and rigorous study on fairness disparities in peer review with the help of large language models (LMs) We collect, assemble, and maintain a comprehensive relational database for the International Conference on Learning Representations (ICLR) conference from 2017 to date. We postulate and study fairness disparities on multiple protective attributes of interest, including author gender, geography, author, and institutional prestige.
arXiv Detail & Related papers (2022-11-07T16:19:42Z)
IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark. IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages. We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z)
On Cross-Lingual Retrieval with Multilingual Text Encoders [51.60862829942932]
We study the suitability of state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks. We benchmark their performance in unsupervised ad-hoc sentence- and document-level CLIR experiments. We evaluate multilingual encoders fine-tuned in a supervised fashion (i.e., we learn to rank) on English relevance data in a series of zero-shot language and domain transfer CLIR experiments.
arXiv Detail & Related papers (2021-12-21T08:10:27Z)
Integrating Knowledge in End-to-End Automatic Speech Recognition for Mandarin-English Code-Switching [41.88097793717185]
Code-Switching (CS) is a common linguistic phenomenon in multilingual communities. This paper presents our investigations on end-to-end speech recognition for Mandarin-English CS speech.
arXiv Detail & Related papers (2021-12-19T17:31:15Z)
Automated Speech Scoring System Under The Lens: Evaluating and interpreting the linguistic cues for language proficiency [26.70127591966917]
We utilize classical machine learning models to formulate a speech scoring task as both a classification and a regression problem. First, we extract linguist features under five categories (fluency, pronunciation, content, grammar and vocabulary, and acoustic) and train models to grade responses. In comparison, we find that the regression-based models perform equivalent to or better than the classification approach.
arXiv Detail & Related papers (2021-11-30T06:28:58Z)
Improving Cross-Lingual Reading Comprehension with Self-Training [62.73937175625953]
Current state-of-the-art models even surpass human performance on several benchmarks. Previous works have revealed the abilities of pre-trained multilingual models for zero-shot cross-lingual reading comprehension. This paper further utilized unlabeled data to improve the performance.
arXiv Detail & Related papers (2021-05-08T08:04:30Z)
Curious Case of Language Generation Evaluation Metrics: A Cautionary Tale [52.663117551150954]
A few popular metrics remain as the de facto metrics to evaluate tasks such as image captioning and machine translation. This is partly due to ease of use, and partly because researchers expect to see them and know how to interpret them. In this paper, we urge the community for more careful consideration of how they automatically evaluate their models.
arXiv Detail & Related papers (2020-10-26T13:57:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.