LRWR: Large-Scale Benchmark for Lip Reading in Russian language
- URL: http://arxiv.org/abs/2109.06692v1
- Date: Tue, 14 Sep 2021 13:51:19 GMT
- Title: LRWR: Large-Scale Benchmark for Lip Reading in Russian language
- Authors: Evgeniy Egorov, Vasily Kostyumov, Mikhail Konyk, Sergey Kolesnikov
- Abstract summary: Lipreading aims to identify the speech content from videos by analyzing the visual deformations of lips and nearby areas.
One of the significant obstacles for research in this field is the lack of proper datasets for a wide variety of languages.
We introduce a naturally distributed benchmark for lipreading in Russian language, named LRWR, which contains 235 classes and 135 speakers.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Lipreading, also known as visual speech recognition, aims to identify the
speech content from videos by analyzing the visual deformations of lips and
nearby areas. One of the significant obstacles for research in this field is
the lack of proper datasets for a wide variety of languages: so far, these
methods have been focused only on English or Chinese. In this paper, we
introduce a naturally distributed large-scale benchmark for lipreading in
Russian language, named LRWR, which contains 235 classes and 135 speakers. We
provide a detailed description of the dataset collection pipeline and dataset
statistics. We also present a comprehensive comparison of the current popular
lipreading methods on LRWR and conduct a detailed analysis of their
performance. The results demonstrate the differences between the benchmarked
languages and provide several promising directions for lipreading models
finetuning. Thanks to our findings, we also achieved new state-of-the-art
results on the LRW benchmark.
Related papers
- A Comparative Study of Translation Bias and Accuracy in Multilingual Large Language Models for Cross-Language Claim Verification [1.566834021297545]
This study systematically evaluates translation bias and the effectiveness of Large Language Models for cross-lingual claim verification.
We investigate two distinct translation methods: pre-translation and self-translation.
Our findings reveal that low-resource languages exhibit significantly lower accuracy in direct inference due to underrepresentation.
arXiv Detail & Related papers (2024-10-14T09:02:42Z) - Multilingual Retrieval Augmented Generation for Culturally-Sensitive Tasks: A Benchmark for Cross-lingual Robustness [30.00463676754559]
We introduce BordIRLines, a benchmark consisting of 720 territorial dispute queries paired with 14k Wikipedia documents across 49 languages.
Our experiments reveal that retrieving multilingual documents best improves response consistency and decreases geopolitical bias over using purely in-language documents.
Our further experiments and case studies investigate how cross-lingual RAG is affected by aspects from IR to document contents.
arXiv Detail & Related papers (2024-10-02T01:59:07Z) - End-to-End Lip Reading in Romanian with Cross-Lingual Domain Adaptation
and Lateral Inhibition [2.839471733237535]
We analyze several architectures and optimizations on the underrepresented, short-scale Romanian language dataset called Wild LRRo.
We obtain state-of-the-art results using our proposed method, namely cross-lingual domain adaptation and unlabeled videos.
We also assess the performance of adding a layer inspired by the neural inhibition mechanism.
arXiv Detail & Related papers (2023-10-07T15:36:58Z) - Leveraging Visemes for Better Visual Speech Representation and Lip
Reading [2.7836084563851284]
We propose a novel approach that leverages visemes, which are groups of phonetically similar lip shapes, to extract more discriminative and robust video features for lip reading.
The proposed method reduces the lip-reading word error rate (WER) by 9.1% relative to the best previous method.
arXiv Detail & Related papers (2023-07-19T17:38:26Z) - FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation [64.9546787488337]
We present FRMT, a new dataset and evaluation benchmark for Few-shot Region-aware Machine Translation.
The dataset consists of professional translations from English into two regional variants each of Portuguese and Mandarin Chinese.
arXiv Detail & Related papers (2022-10-01T05:02:04Z) - A Multi-level Supervised Contrastive Learning Framework for Low-Resource
Natural Language Inference [54.678516076366506]
Natural Language Inference (NLI) is a growingly essential task in natural language understanding.
Here we propose a multi-level supervised contrastive learning framework named MultiSCL for low-resource natural language inference.
arXiv Detail & Related papers (2022-05-31T05:54:18Z) - A Closer Look at Debiased Temporal Sentence Grounding in Videos:
Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video.
Recent studies have found that current benchmark datasets may have obvious moment annotation biases.
We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z) - A Multimodal German Dataset for Automatic Lip Reading Systems and
Transfer Learning [18.862801476204886]
We present the dataset GLips (German Lips) consisting of 250,000 publicly available videos of the faces of speakers of the Hessian Parliament.
The format is similar to that of the English language LRW (Lip Reading in the Wild) dataset, with each video encoding one word of interest in a context of 1.16 seconds duration.
By training a deep neural network, we investigate whether lip reading has language-independent features, so that datasets of different languages can be used to improve lip reading models.
arXiv Detail & Related papers (2022-02-27T17:37:35Z) - IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and
Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark.
IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages.
We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z) - Sub-word Level Lip Reading With Visual Attention [88.89348882036512]
We focus on the unique challenges encountered in lip reading and propose tailored solutions.
We obtain state-of-the-art results on the challenging LRS2 and LRS3 benchmarks when training on public datasets.
Our best model achieves 22.6% word error rate on the LRS2 dataset, a performance unprecedented for lip reading models.
arXiv Detail & Related papers (2021-10-14T17:59:57Z) - Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval [51.004601358498135]
Mr. TyDi is a benchmark dataset for mono-lingual retrieval in eleven typologically diverse languages.
The goal of this resource is to spur research in dense retrieval techniques in non-English languages.
arXiv Detail & Related papers (2021-08-19T16:53:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.