A Multi-Purpose Audio-Visual Corpus for Multi-Modal Persian Speech
Recognition: the Arman-AV Dataset
- URL: http://arxiv.org/abs/2301.10180v1
- Date: Sat, 21 Jan 2023 05:13:30 GMT
- Title: A Multi-Purpose Audio-Visual Corpus for Multi-Modal Persian Speech
Recognition: the Arman-AV Dataset
- Authors: Javad Peymanfard, Samin Heydarian, Ali Lashini, Hossein Zeinali,
Mohammad Reza Mohammadi, Nasser Mozayani
- Abstract summary: This paper presents a new multipurpose audio-visual dataset for Persian.
It consists of almost 220 hours of videos with 1760 corresponding speakers.
The dataset is suitable for automatic speech recognition, audio-visual speech recognition, and speaker recognition.
- Score: 2.594602184695942
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, significant progress has been made in automatic lip reading.
But these methods require large-scale datasets that do not exist for many
low-resource languages. In this paper, we have presented a new multipurpose
audio-visual dataset for Persian. This dataset consists of almost 220 hours of
videos with 1760 corresponding speakers. In addition to lip reading, the
dataset is suitable for automatic speech recognition, audio-visual speech
recognition, and speaker recognition. Also, it is the first large-scale lip
reading dataset in Persian. A baseline method was provided for each mentioned
task. In addition, we have proposed a technique to detect visemes (a visual
equivalent of a phoneme) in Persian. The visemes obtained by this method
increase the accuracy of the lip reading task by 7% relatively compared to the
previously proposed visemes, which can be applied to other languages as well.
Related papers
- Identifying Speakers in Dialogue Transcripts: A Text-based Approach Using Pretrained Language Models [83.7506131809624]
We introduce an approach to identifying speaker names in dialogue transcripts, a crucial task for enhancing content accessibility and searchability in digital media archives.
We present a novel, large-scale dataset derived from the MediaSum corpus, encompassing transcripts from a wide range of media sources.
We propose novel transformer-based models tailored for SpeakerID, leveraging contextual cues within dialogues to accurately attribute speaker names.
arXiv Detail & Related papers (2024-07-16T18:03:58Z) - Cross-Attention Fusion of Visual and Geometric Features for Large
Vocabulary Arabic Lipreading [3.502468086816445]
Lipreading involves using visual data to recognize spoken words by analyzing the movements of the lips and surrounding area.
Recent deep-learning based works aim to integrate visual features extracted from the mouth region with landmark points on the lip contours.
We propose a cross-attention fusion-based approach for large lexicon Arabic vocabulary to predict spoken words in videos.
arXiv Detail & Related papers (2024-02-18T09:22:58Z) - Deepfake audio as a data augmentation technique for training automatic
speech to text transcription models [55.2480439325792]
We propose a framework that approaches data augmentation based on deepfake audio.
A dataset produced by Indians (in English) was selected, ensuring the presence of a single accent.
arXiv Detail & Related papers (2023-09-22T11:33:03Z) - Leveraging Visemes for Better Visual Speech Representation and Lip
Reading [2.7836084563851284]
We propose a novel approach that leverages visemes, which are groups of phonetically similar lip shapes, to extract more discriminative and robust video features for lip reading.
The proposed method reduces the lip-reading word error rate (WER) by 9.1% relative to the best previous method.
arXiv Detail & Related papers (2023-07-19T17:38:26Z) - An Automatic Speech Recognition System for Bengali Language based on
Wav2Vec2 and Transfer Learning [0.0]
This paper aims to improve the speech recognition performance of the Bengali language by adopting speech recognition technology on the E2E structure based on the transfer learning framework.
The proposed method effectively models the Bengali language and achieves 3.819 score in Levenshtein Mean Distance' on the test dataset of 7747 samples, when only 1000 samples of train dataset were used to train.
arXiv Detail & Related papers (2022-09-16T18:20:16Z) - ASR2K: Speech Recognition for Around 2000 Languages without Audio [100.41158814934802]
We present a speech recognition pipeline that does not require any audio for the target language.
Our pipeline consists of three components: acoustic, pronunciation, and language models.
We build speech recognition for 1909 languages by combining it with Crubadan: a large endangered languages n-gram database.
arXiv Detail & Related papers (2022-09-06T22:48:29Z) - Visual Speech Recognition for Multiple Languages in the Wild [64.52593130370757]
We show that designing better VSR models is equally important to using larger training sets.
We propose the addition of prediction-based auxiliary tasks to a VSR model.
We show that such model works for different languages and outperforms all previous methods trained on publicly available datasets by a large margin.
arXiv Detail & Related papers (2022-02-26T07:21:00Z) - Data Augmentation for Speech Recognition in Maltese: A Low-Resource
Perspective [4.6898263272139795]
We consider data augmentation techniques for improving speech recognition in Maltese.
We consider three types of data augmentation: unsupervised training, multilingual training and the use of synthesized speech as training data.
Our results show that combining the three data augmentation techniques studied here lead us to an absolute WER improvement of 15% without the use of a language model.
arXiv Detail & Related papers (2021-11-15T14:28:21Z) - Sub-word Level Lip Reading With Visual Attention [88.89348882036512]
We focus on the unique challenges encountered in lip reading and propose tailored solutions.
We obtain state-of-the-art results on the challenging LRS2 and LRS3 benchmarks when training on public datasets.
Our best model achieves 22.6% word error rate on the LRS2 dataset, a performance unprecedented for lip reading models.
arXiv Detail & Related papers (2021-10-14T17:59:57Z) - Multitask Training with Text Data for End-to-End Speech Recognition [45.35605825009208]
We propose a multitask training method for attention-based end-to-end speech recognition models.
We regularize the decoder in a listen, attend, and spell model by multitask training it on both audio-text and text-only data.
arXiv Detail & Related papers (2020-10-27T14:29:28Z) - A Corpus for Large-Scale Phonetic Typology [112.19288631037055]
We present VoxClamantis v1.0, the first large-scale corpus for phonetic typology.
aligned segments and estimated phoneme-level labels in 690 readings spanning 635 languages, along with acoustic-phonetic measures of vowels and sibilants.
arXiv Detail & Related papers (2020-05-28T13:03:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.