LIP-RTVE: An Audiovisual Database for Continuous Spanish in the Wild
- URL: http://arxiv.org/abs/2311.12457v1
- Date: Tue, 21 Nov 2023 09:12:21 GMT
- Title: LIP-RTVE: An Audiovisual Database for Continuous Spanish in the Wild
- Authors: David Gimeno-G\'omez, Carlos-D. Mart\'inez-Hinarejos
- Abstract summary: This paper presents a semi-automatically annotated audiovisual database to deal with unconstrained natural Spanish.
Results for both speaker-dependent and speaker-independent scenarios are reported using Hidden Markov Models.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Speech is considered as a multi-modal process where hearing and vision are
two fundamentals pillars. In fact, several studies have demonstrated that the
robustness of Automatic Speech Recognition systems can be improved when audio
and visual cues are combined to represent the nature of speech. In addition,
Visual Speech Recognition, an open research problem whose purpose is to
interpret speech by reading the lips of the speaker, has been a focus of
interest in the last decades. Nevertheless, in order to estimate these systems
in the currently Deep Learning era, large-scale databases are required. On the
other hand, while most of these databases are dedicated to English, other
languages lack sufficient resources. Thus, this paper presents a
semi-automatically annotated audiovisual database to deal with unconstrained
natural Spanish, providing 13 hours of data extracted from Spanish television.
Furthermore, baseline results for both speaker-dependent and
speaker-independent scenarios are reported using Hidden Markov Models, a
traditional paradigm that has been widely used in the field of Speech
Technologies.
Related papers
- PRODIS - a speech database and a phoneme-based language model for the study of predictability effects in Polish [1.2016264781280588]
We present a speech database and a phoneme-level language model of Polish.
The database is the first large, publicly available Polish speech corpus of excellent acoustic quality.
arXiv Detail & Related papers (2024-04-15T20:03:58Z) - Speaker-Adapted End-to-End Visual Speech Recognition for Continuous
Spanish [0.0]
This paper studies how estimation of specialized end-to-end systems for a specific person could affect the quality of speech recognition.
Results comparable to the current state of the art were reached even when only a limited amount of data was available.
arXiv Detail & Related papers (2023-11-21T09:44:33Z) - Analysis of Visual Features for Continuous Lipreading in Spanish [0.0]
lipreading is a complex task whose objective is to interpret speech when audio is not available.
We propose an analysis of different speech visual features with the intention of identifying which of them is the best approach to capture the nature of lip movements for natural Spanish.
arXiv Detail & Related papers (2023-11-21T09:28:00Z) - Exploring Speech Recognition, Translation, and Understanding with
Discrete Speech Units: A Comparative Study [68.88536866933038]
Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies.
Recent investigations proposed the use of discrete speech units derived from self-supervised learning representations.
Applying various methods, such as de-duplication and subword modeling, can further compress the speech sequence length.
arXiv Detail & Related papers (2023-09-27T17:21:13Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - Learning Cross-lingual Visual Speech Representations [108.68531445641769]
Cross-lingual self-supervised visual representation learning has been a growing research topic in the last few years.
We use the recently-proposed Raw Audio-Visual Speechs (RAVEn) framework to pre-train an audio-visual model with unlabelled data.
Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance.
arXiv Detail & Related papers (2023-03-14T17:05:08Z) - OLKAVS: An Open Large-Scale Korean Audio-Visual Speech Dataset [14.619865864254924]
Open Large-scale Korean Audio-Visual Speech (OLKAVS) dataset is the largest among publicly available audio-visual speech datasets.
The dataset contains 1,150 hours of transcribed audio from 1,107 Korean speakers in a studio setup with nine different viewpoints and various noise situations.
arXiv Detail & Related papers (2023-01-16T11:40:50Z) - ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis
Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS)
A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation.
We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.