Cross-Attention Fusion of Visual and Geometric Features for Large
Vocabulary Arabic Lipreading
- URL: http://arxiv.org/abs/2402.11520v1
- Date: Sun, 18 Feb 2024 09:22:58 GMT
- Title: Cross-Attention Fusion of Visual and Geometric Features for Large
Vocabulary Arabic Lipreading
- Authors: Samar Daou, Ahmed Rekik, Achraf Ben-Hamadou, Abdelaziz Kallel
- Abstract summary: Lipreading involves using visual data to recognize spoken words by analyzing the movements of the lips and surrounding area.
Recent deep-learning based works aim to integrate visual features extracted from the mouth region with landmark points on the lip contours.
We propose a cross-attention fusion-based approach for large lexicon Arabic vocabulary to predict spoken words in videos.
- Score: 3.502468086816445
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Lipreading involves using visual data to recognize spoken words by analyzing
the movements of the lips and surrounding area. It is a hot research topic with
many potential applications, such as human-machine interaction and enhancing
audio speech recognition. Recent deep-learning based works aim to integrate
visual features extracted from the mouth region with landmark points on the lip
contours. However, employing a simple combination method such as concatenation
may not be the most effective approach to get the optimal feature vector. To
address this challenge, firstly, we propose a cross-attention fusion-based
approach for large lexicon Arabic vocabulary to predict spoken words in videos.
Our method leverages the power of cross-attention networks to efficiently
integrate visual and geometric features computed on the mouth region. Secondly,
we introduce the first large-scale Lip Reading in the Wild for Arabic (LRW-AR)
dataset containing 20,000 videos for 100-word classes, uttered by 36 speakers.
The experimental results obtained on LRW-AR and ArabicVisual databases showed
the effectiveness and robustness of the proposed approach in recognizing Arabic
words. Our work provides insights into the feasibility and effectiveness of
applying lipreading techniques to the Arabic language, opening doors for
further research in this field. Link to the project page:
https://crns-smartvision.github.io/lrwar
Related papers
- Align before Attend: Aligning Visual and Textual Features for Multimodal
Hateful Content Detection [4.997673761305336]
This paper proposes a context-aware attention framework for multimodal hateful content detection.
We evaluate the proposed approach on two benchmark hateful meme datasets, viz. MUTE (Bengali code-mixed) and MultiOFF (English)
arXiv Detail & Related papers (2024-02-15T06:34:15Z) - Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation [55.15299351110525]
This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can recognize different languages with a single trained model.
We propose a novel training strategy, processing with visual speech units.
We set new state-of-the-art multilingual VSR performances by achieving comparable performances to the previous language-specific VSR models.
arXiv Detail & Related papers (2024-01-18T08:46:02Z) - Analysis of Visual Features for Continuous Lipreading in Spanish [0.0]
lipreading is a complex task whose objective is to interpret speech when audio is not available.
We propose an analysis of different speech visual features with the intention of identifying which of them is the best approach to capture the nature of lip movements for natural Spanish.
arXiv Detail & Related papers (2023-11-21T09:28:00Z) - Leveraging Visemes for Better Visual Speech Representation and Lip
Reading [2.7836084563851284]
We propose a novel approach that leverages visemes, which are groups of phonetically similar lip shapes, to extract more discriminative and robust video features for lip reading.
The proposed method reduces the lip-reading word error rate (WER) by 9.1% relative to the best previous method.
arXiv Detail & Related papers (2023-07-19T17:38:26Z) - MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup
for Visual Speech Translation and Recognition [51.412413996510814]
We propose MixSpeech, a cross-modality self-learning framework that utilizes audio speech to regularize the training of visual speech tasks.
MixSpeech enhances speech translation in noisy environments, improving BLEU scores for four languages on AVMuST-TED by +1.4 to +4.2.
arXiv Detail & Related papers (2023-03-09T14:58:29Z) - A Multi-Purpose Audio-Visual Corpus for Multi-Modal Persian Speech
Recognition: the Arman-AV Dataset [2.594602184695942]
This paper presents a new multipurpose audio-visual dataset for Persian.
It consists of almost 220 hours of videos with 1760 corresponding speakers.
The dataset is suitable for automatic speech recognition, audio-visual speech recognition, and speaker recognition.
arXiv Detail & Related papers (2023-01-21T05:13:30Z) - Bench-Marking And Improving Arabic Automatic Image Captioning Through
The Use Of Multi-Task Learning Paradigm [0.0]
This paper explores methods and techniques that could enhance the performance of Arabic image captioning.
The use of multi-task learning and pre-trained word embeddings noticeably enhanced the quality of image captioning.
However, the presented results shows that Arabic captioning still lags behind when compared to the English language.
arXiv Detail & Related papers (2022-02-11T06:29:25Z) - Video-Text Pre-training with Learned Regions [59.30893505895156]
Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs.
We propose a module for videotext-learning, RegionLearner, which can take into account the structure of objects during pre-training on large-scale video-text pairs.
arXiv Detail & Related papers (2021-12-02T13:06:53Z) - Sub-word Level Lip Reading With Visual Attention [88.89348882036512]
We focus on the unique challenges encountered in lip reading and propose tailored solutions.
We obtain state-of-the-art results on the challenging LRS2 and LRS3 benchmarks when training on public datasets.
Our best model achieves 22.6% word error rate on the LRS2 dataset, a performance unprecedented for lip reading models.
arXiv Detail & Related papers (2021-10-14T17:59:57Z) - LiRA: Learning Visual Speech Representations from Audio through
Self-supervision [53.18768477520411]
We propose Learning visual speech Representations from Audio via self-supervision (LiRA)
Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech.
We show that our approach significantly outperforms other self-supervised methods on the Lip Reading in the Wild dataset.
arXiv Detail & Related papers (2021-06-16T23:20:06Z) - Visual Grounding in Video for Unsupervised Word Translation [91.47607488740647]
We use visual grounding to improve unsupervised word mapping between languages.
We learn embeddings from unpaired instructional videos narrated in the native language.
We apply these methods to translate words from English to French, Korean, and Japanese.
arXiv Detail & Related papers (2020-03-11T02:03:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.