A Multimodal German Dataset for Automatic Lip Reading Systems and
Transfer Learning
- URL: http://arxiv.org/abs/2202.13403v1
- Date: Sun, 27 Feb 2022 17:37:35 GMT
- Title: A Multimodal German Dataset for Automatic Lip Reading Systems and
Transfer Learning
- Authors: Gerald Schwiebert, Cornelius Weber, Leyuan Qu, Henrique Siqueira,
Stefan Wermter
- Abstract summary: We present the dataset GLips (German Lips) consisting of 250,000 publicly available videos of the faces of speakers of the Hessian Parliament.
The format is similar to that of the English language LRW (Lip Reading in the Wild) dataset, with each video encoding one word of interest in a context of 1.16 seconds duration.
By training a deep neural network, we investigate whether lip reading has language-independent features, so that datasets of different languages can be used to improve lip reading models.
- Score: 18.862801476204886
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large datasets as required for deep learning of lip reading do not exist in
many languages. In this paper we present the dataset GLips (German Lips)
consisting of 250,000 publicly available videos of the faces of speakers of the
Hessian Parliament, which was processed for word-level lip reading using an
automatic pipeline. The format is similar to that of the English language LRW
(Lip Reading in the Wild) dataset, with each video encoding one word of
interest in a context of 1.16 seconds duration, which yields compatibility for
studying transfer learning between both datasets. By training a deep neural
network, we investigate whether lip reading has language-independent features,
so that datasets of different languages can be used to improve lip reading
models. We demonstrate learning from scratch and show that transfer learning
from LRW to GLips and vice versa improves learning speed and performance, in
particular for the validation set.
Related papers
- Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing [56.71450690166821]
We propose a novel framework, namely Visual Speech Processing incorporated with LLMs (VSP-LLM)
VSP-LLM is designed to perform multi-tasks of visual speech recognition and translation.
We show that VSP-LLM trained on just 30 hours of labeled data can more effectively translate lip movements.
arXiv Detail & Related papers (2024-02-23T07:21:32Z) - Embracing Language Inclusivity and Diversity in CLIP through Continual
Language Learning [58.92843729869586]
Vision-language pre-trained models (VL-PTMs) have advanced multimodal research in recent years, but their mastery in a few languages like English restricts their applicability in broader communities.
We propose to extend VL-PTMs' language capacity by continual language learning (CLL), where a model needs to update its linguistic knowledge incrementally without suffering from catastrophic forgetting (CF)
We construct a CLL benchmark covering 36 languages based on MSCOCO and XM3600 datasets and then evaluate multilingual image-text retrieval performance.
arXiv Detail & Related papers (2024-01-30T17:14:05Z) - Lip Reading for Low-resource Languages by Learning and Combining General
Speech Knowledge and Language-specific Knowledge [57.38948190611797]
This paper proposes a novel lip reading framework, especially for low-resource languages.
Since low-resource languages do not have enough video-text paired data to train the model, it is regarded as challenging to develop lip reading models for low-resource languages.
arXiv Detail & Related papers (2023-08-18T05:19:03Z) - A Multi-Purpose Audio-Visual Corpus for Multi-Modal Persian Speech
Recognition: the Arman-AV Dataset [2.594602184695942]
This paper presents a new multipurpose audio-visual dataset for Persian.
It consists of almost 220 hours of videos with 1760 corresponding speakers.
The dataset is suitable for automatic speech recognition, audio-visual speech recognition, and speaker recognition.
arXiv Detail & Related papers (2023-01-21T05:13:30Z) - Sub-word Level Lip Reading With Visual Attention [88.89348882036512]
We focus on the unique challenges encountered in lip reading and propose tailored solutions.
We obtain state-of-the-art results on the challenging LRS2 and LRS3 benchmarks when training on public datasets.
Our best model achieves 22.6% word error rate on the LRS2 dataset, a performance unprecedented for lip reading models.
arXiv Detail & Related papers (2021-10-14T17:59:57Z) - LRWR: Large-Scale Benchmark for Lip Reading in Russian language [0.0]
Lipreading aims to identify the speech content from videos by analyzing the visual deformations of lips and nearby areas.
One of the significant obstacles for research in this field is the lack of proper datasets for a wide variety of languages.
We introduce a naturally distributed benchmark for lipreading in Russian language, named LRWR, which contains 235 classes and 135 speakers.
arXiv Detail & Related papers (2021-09-14T13:51:19Z) - Lip reading using external viseme decoding [4.728757318184405]
This paper shows how to use external text data (for viseme-to-character mapping) by dividing video-to-character into two stages.
Our proposed method improves word error rate by 4% compared to the normal sequence to sequence lip-reading model on the BBC-Oxford Lip Reading Sentences 2 dataset.
arXiv Detail & Related papers (2021-04-10T14:49:11Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - Synchronous Bidirectional Learning for Multilingual Lip Reading [99.14744013265594]
Lip movements of all languages share similar patterns due to the common structures of human organs.
Phonemes are more closely related with the lip movements than the alphabet letters.
A novel SBL block is proposed to learn the rules for each language in a fill-in-the-blank way.
arXiv Detail & Related papers (2020-05-08T04:19:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.