LRWR: Large-Scale Benchmark for Lip Reading in Russian language
- URL: http://arxiv.org/abs/2109.06692v1
- Date: Tue, 14 Sep 2021 13:51:19 GMT
- Title: LRWR: Large-Scale Benchmark for Lip Reading in Russian language
- Authors: Evgeniy Egorov, Vasily Kostyumov, Mikhail Konyk, Sergey Kolesnikov
- Abstract summary: Lipreading aims to identify the speech content from videos by analyzing the visual deformations of lips and nearby areas.
One of the significant obstacles for research in this field is the lack of proper datasets for a wide variety of languages.
We introduce a naturally distributed benchmark for lipreading in Russian language, named LRWR, which contains 235 classes and 135 speakers.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Lipreading, also known as visual speech recognition, aims to identify the
speech content from videos by analyzing the visual deformations of lips and
nearby areas. One of the significant obstacles for research in this field is
the lack of proper datasets for a wide variety of languages: so far, these
methods have been focused only on English or Chinese. In this paper, we
introduce a naturally distributed large-scale benchmark for lipreading in
Russian language, named LRWR, which contains 235 classes and 135 speakers. We
provide a detailed description of the dataset collection pipeline and dataset
statistics. We also present a comprehensive comparison of the current popular
lipreading methods on LRWR and conduct a detailed analysis of their
performance. The results demonstrate the differences between the benchmarked
languages and provide several promising directions for lipreading models
finetuning. Thanks to our findings, we also achieved new state-of-the-art
results on the LRW benchmark.
Related papers
- A Comparative Study of Translation Bias and Accuracy in Multilingual Large Language Models for Cross-Language Claim Verification [1.566834021297545]
This study systematically evaluates translation bias and the effectiveness of Large Language Models for cross-lingual claim verification.
We investigate two distinct translation methods: pre-translation and self-translation.
Our findings reveal that low-resource languages exhibit significantly lower accuracy in direct inference due to underrepresentation.
arXiv Detail & Related papers (2024-10-14T09:02:42Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - End-to-End Lip Reading in Romanian with Cross-Lingual Domain Adaptation
and Lateral Inhibition [2.839471733237535]
We analyze several architectures and optimizations on the underrepresented, short-scale Romanian language dataset called Wild LRRo.
We obtain state-of-the-art results using our proposed method, namely cross-lingual domain adaptation and unlabeled videos.
We also assess the performance of adding a layer inspired by the neural inhibition mechanism.
arXiv Detail & Related papers (2023-10-07T15:36:58Z) - Leveraging Visemes for Better Visual Speech Representation and Lip
Reading [2.7836084563851284]
We propose a novel approach that leverages visemes, which are groups of phonetically similar lip shapes, to extract more discriminative and robust video features for lip reading.
The proposed method reduces the lip-reading word error rate (WER) by 9.1% relative to the best previous method.
arXiv Detail & Related papers (2023-07-19T17:38:26Z) - FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation [64.9546787488337]
We present FRMT, a new dataset and evaluation benchmark for Few-shot Region-aware Machine Translation.
The dataset consists of professional translations from English into two regional variants each of Portuguese and Mandarin Chinese.
arXiv Detail & Related papers (2022-10-01T05:02:04Z) - A Closer Look at Debiased Temporal Sentence Grounding in Videos:
Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video.
Recent studies have found that current benchmark datasets may have obvious moment annotation biases.
We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z) - A Multimodal German Dataset for Automatic Lip Reading Systems and
Transfer Learning [18.862801476204886]
We present the dataset GLips (German Lips) consisting of 250,000 publicly available videos of the faces of speakers of the Hessian Parliament.
The format is similar to that of the English language LRW (Lip Reading in the Wild) dataset, with each video encoding one word of interest in a context of 1.16 seconds duration.
By training a deep neural network, we investigate whether lip reading has language-independent features, so that datasets of different languages can be used to improve lip reading models.
arXiv Detail & Related papers (2022-02-27T17:37:35Z) - IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and
Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark.
IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages.
We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z) - Sub-word Level Lip Reading With Visual Attention [88.89348882036512]
We focus on the unique challenges encountered in lip reading and propose tailored solutions.
We obtain state-of-the-art results on the challenging LRS2 and LRS3 benchmarks when training on public datasets.
Our best model achieves 22.6% word error rate on the LRS2 dataset, a performance unprecedented for lip reading models.
arXiv Detail & Related papers (2021-10-14T17:59:57Z) - Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval [51.004601358498135]
Mr. TyDi is a benchmark dataset for mono-lingual retrieval in eleven typologically diverse languages.
The goal of this resource is to spur research in dense retrieval techniques in non-English languages.
arXiv Detail & Related papers (2021-08-19T16:53:43Z) - Cross-language Sentence Selection via Data Augmentation and Rationale
Training [22.106577427237635]
It uses data augmentation and negative sampling techniques on noisy parallel sentence data to learn a cross-lingual embedding-based query relevance model.
Results show that this approach performs as well as or better than multiple state-of-the-art machine translation + monolingual retrieval systems trained on the same parallel data.
arXiv Detail & Related papers (2021-06-04T07:08:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.