LRW-Persian: Lip-reading in the Wild Dataset for Persian Language
- URL: http://arxiv.org/abs/2510.22716v1
- Date: Sun, 26 Oct 2025 15:21:42 GMT
- Title: LRW-Persian: Lip-reading in the Wild Dataset for Persian Language
- Authors: Zahra Taghizadeh, Mohammad Shahverdikondori, Arian Noori, Alireza Dadgarnia,
- Abstract summary: LRW-Persian is the largest in-the-wild Persian word-level lipreading dataset.<n>It provides speaker-disjoint training and test splits, wide regional and dialectal coverage, and rich per-clip metadata.
- Score: 1.1666234644810893
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Lipreading has emerged as an increasingly important research area for developing robust speech recognition systems and assistive technologies for the hearing-impaired. However, non-English resources for visual speech recognition remain limited. We introduce LRW-Persian, the largest in-the-wild Persian word-level lipreading dataset, comprising $743$ target words and over $414{,}000$ video samples extracted from more than $1{,}900$ hours of footage across $67$ television programs. Designed as a benchmark-ready resource, LRW-Persian provides speaker-disjoint training and test splits, wide regional and dialectal coverage, and rich per-clip metadata including head pose, age, and gender. To ensure large-scale data quality, we establish a fully automated end-to-end curation pipeline encompassing transcription based on Automatic Speech Recognition(ASR), active-speaker localization, quality filtering, and pose/mask screening. We further fine-tune two widely used lipreading architectures on LRW-Persian, establishing reference performance and demonstrating the difficulty of Persian visual speech recognition. By filling a critical gap in low-resource languages, LRW-Persian enables rigorous benchmarking, supports cross-lingual transfer, and provides a foundation for advancing multimodal speech research in underrepresented linguistic contexts. The dataset is publicly available at: https://lrw-persian.vercel.app.
Related papers
- PRiSM: Benchmarking Phone Realization in Speech Models [70.82595415252682]
Phone recognition (PR) serves as the atomic interface for language-agnostic modeling for cross-lingual speech processing and phonetic analysis.<n>We introduce PRiSM, the first open-source benchmark designed to expose blind spots in phonetic perception.
arXiv Detail & Related papers (2026-01-20T15:00:36Z) - Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations [65.59784436914548]
We introduce the Audio-Visual Speech Romanizer (AV-Romanizer), which learns language-agnostic speech representations by predicting Roman text.<n>We convert the predicted Roman text into language-specific graphemes, forming the proposed Cascaded Zero-AVSR.<n>To capture the wide spectrum of phonetic and linguistic diversity, we also introduce a Multilingual Audio-Visual Romanized Corpus (MARC)
arXiv Detail & Related papers (2025-03-08T16:40:13Z) - XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception [62.660135152900615]
Speech recognition and translation systems perform poorly on noisy inputs.
XLAVS-R is a cross-lingual audio-visual speech representation model for noise-robust speech recognition and translation.
arXiv Detail & Related papers (2024-03-21T13:52:17Z) - Cross-Attention Fusion of Visual and Geometric Features for Large
Vocabulary Arabic Lipreading [3.502468086816445]
Lipreading involves using visual data to recognize spoken words by analyzing the movements of the lips and surrounding area.
Recent deep-learning based works aim to integrate visual features extracted from the mouth region with landmark points on the lip contours.
We propose a cross-attention fusion-based approach for large lexicon Arabic vocabulary to predict spoken words in videos.
arXiv Detail & Related papers (2024-02-18T09:22:58Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - A Multi-Purpose Audio-Visual Corpus for Multi-Modal Persian Speech
Recognition: the Arman-AV Dataset [2.594602184695942]
This paper presents a new multipurpose audio-visual dataset for Persian.
It consists of almost 220 hours of videos with 1760 corresponding speakers.
The dataset is suitable for automatic speech recognition, audio-visual speech recognition, and speaker recognition.
arXiv Detail & Related papers (2023-01-21T05:13:30Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction
and Lip Reading [24.744371143092614]
The aim of this work is to investigate the impact of crossmodal self-supervised pre-training for speech reconstruction (video-to-audio) by leveraging the natural co-occurrence of audio and visual streams in videos.
We propose LipSound2, which consists of an encoder-decoder architecture and location-aware attention mechanism to map face image sequences to mel-scale spectrograms.
arXiv Detail & Related papers (2021-12-09T08:11:35Z) - LRWR: Large-Scale Benchmark for Lip Reading in Russian language [0.0]
Lipreading aims to identify the speech content from videos by analyzing the visual deformations of lips and nearby areas.
One of the significant obstacles for research in this field is the lack of proper datasets for a wide variety of languages.
We introduce a naturally distributed benchmark for lipreading in Russian language, named LRWR, which contains 235 classes and 135 speakers.
arXiv Detail & Related papers (2021-09-14T13:51:19Z) - QASR: QCRI Aljazeera Speech Resource -- A Large Scale Annotated Arabic
Speech Corpus [11.113497373432411]
We introduce the largest transcribed Arabic speech corpus, QASR, collected from the broadcast domain.
This multi-dialect speech dataset contains 2,000 hours of speech sampled at 16kHz crawled from Aljazeera news channel.
arXiv Detail & Related papers (2021-06-24T13:20:40Z) - Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource
End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate.
We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique.
Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.