Related papers: Phoneme-Level Visual Speech Recognition via Point-Visual Fusion and Language Model Reconstruction

Phoneme-Level Visual Speech Recognition via Point-Visual Fusion and Language Model Reconstruction

URL: http://arxiv.org/abs/2507.18863v1
Date: Fri, 25 Jul 2025 00:38:39 GMT
Title: Phoneme-Level Visual Speech Recognition via Point-Visual Fusion and Language Model Reconstruction
Authors: Matthew Kit Khinn Teng, Haibo Zhang, Takeshi Saitoh,
Abstract summary: Visual Automatic Speech Recognition (V-ASR) is a challenging task that involves interpreting spoken language solely from visual information, such as lip movements and facial expressions.<n>Existing methods often aim to predict words directly from visual cues, but they commonly suffer from high error rates due to viseme ambiguity.<n>We propose a novel phoneme-based two-stage framework that fuses visual and landmark motion features.
Score: 1.778037147204838
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual Automatic Speech Recognition (V-ASR) is a challenging task that involves interpreting spoken language solely from visual information, such as lip movements and facial expressions. This task is notably challenging due to the absence of auditory cues and the visual ambiguity of phonemes that exhibit similar visemes-distinct sounds that appear identical in lip motions. Existing methods often aim to predict words or characters directly from visual cues, but they commonly suffer from high error rates due to viseme ambiguity and require large amounts of pre-training data. We propose a novel phoneme-based two-stage framework that fuses visual and landmark motion features, followed by an LLM model for word reconstruction to address these challenges. Stage 1 consists of V-ASR, which outputs the predicted phonemes, thereby reducing training complexity. Meanwhile, the facial landmark features address speaker-specific facial characteristics. Stage 2 comprises an encoder-decoder LLM model, NLLB, that reconstructs the output phonemes back to words. Besides using a large visual dataset for deep learning fine-tuning, our PV-ASR method demonstrates superior performance by achieving 17.4% WER on the LRS2 and 21.0% WER on the LRS3 dataset.

Related papers

Text2Lip: Progressive Lip-Synced Talking Face Generation from Text via Viseme-Guided Rendering [53.2204901422631]
Text2Lip is a viseme-centric framework that constructs an interpretable phonetic-visual bridge.<n>We show that Text2Lip outperforms existing approaches in semantic fidelity, visual realism, and modality robustness.
arXiv Detail & Related papers (2025-08-04T12:50:22Z)
VALLR: Visual ASR Language Model for Lip Reading [28.561566996686484]
Lip Reading, or Visual Automatic Speech Recognition, is a complex task requiring the interpretation of spoken language exclusively from visual cues.<n>We propose a novel two-stage, phoneme-centric framework for Visual Automatic Speech Recognition (V-ASR)<n>First, our model predicts a compact sequence of phonemes from visual inputs using a Video Transformer with a CTC head.<n>This phoneme output then serves as the input to a fine-tuned Large Language Model (LLM), which reconstructs coherent words and sentences.
arXiv Detail & Related papers (2025-03-27T11:52:08Z)
FiVL: A Framework for Improved Vision-Language Alignment through the Lens of Training, Evaluation and Explainability [10.184567639685321]
We introduce FiVL, a novel method for constructing datasets designed to train LVLMs for enhanced visual grounding.<n>We present benchmarks to assess the model's ability to use image as substantive evidence.<n>We identify attention heads with the strongest vision-language alignment, enabling explainability on visual-driven hallucinations.
arXiv Detail & Related papers (2024-12-19T09:24:10Z)
Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance [67.26434607115392]
Large vision-language models (LVLMs) have achieved impressive results in various vision-language tasks. LVLMs suffer from hallucinations caused by language bias, leading to diminished focus on images and ineffective visual comprehension. We propose LACING to address the language bias of LVLMs with muLtimodal duAl-attention meChanIsm (MDA) aNd soft-image Guidance (IFG)
arXiv Detail & Related papers (2024-11-21T16:33:30Z)
Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing [56.71450690166821]
We propose a novel framework, namely Visual Speech Processing incorporated with LLMs (VSP-LLM) VSP-LLM is designed to perform multi-tasks of visual speech recognition and translation. We show that VSP-LLM trained on just 30 hours of labeled data can more effectively translate lip movements.
arXiv Detail & Related papers (2024-02-23T07:21:32Z)
Lip2Vec: Efficient and Robust Visual Speech Recognition via Latent-to-Latent Visual to Audio Representation Mapping [4.271091833712731]
We propose a simple approach, named Lip2Vec that is based on learning a prior model. The proposed model compares favorably with fully-supervised learning methods on the LRS3 dataset achieving 26 WER. We believe that reprogramming the VSR as an ASR task narrows the performance gap between the two and paves the way for more flexible formulations of lip reading.
arXiv Detail & Related papers (2023-08-11T12:59:02Z)
Linguistic More: Taking a Further Step toward Efficient and Accurate Scene Text Recognition [92.6211155264297]
Vision models have gained increasing attention due to their simplicity and efficiency in Scene Text Recognition (STR) task. Recent vision models suffer from two problems: (1) the pure vision-based query results in attention drift, which usually causes poor recognition and is summarized as linguistic insensitive drift (LID) problem in this paper. We propose a $textbfL$inguistic $textbfP$erception $textbfV$ision model (LPV) which explores the linguistic capability of vision model for accurate text recognition.
arXiv Detail & Related papers (2023-05-09T02:52:47Z)
VCSE: Time-Domain Visual-Contextual Speaker Extraction Network [54.67547526785552]
We propose a two-stage time-domain visual-contextual speaker extraction network named VCSE. In the first stage, we pre-extract a target speech with visual cues and estimate the underlying phonetic sequence. In the second stage, we refine the pre-extracted target speech with the self-enrolled contextual cues.
arXiv Detail & Related papers (2022-10-09T12:29:38Z)
Sub-word Level Lip Reading With Visual Attention [88.89348882036512]
We focus on the unique challenges encountered in lip reading and propose tailored solutions. We obtain state-of-the-art results on the challenging LRS2 and LRS3 benchmarks when training on public datasets. Our best model achieves 22.6% word error rate on the LRS2 dataset, a performance unprecedented for lip reading models.
arXiv Detail & Related papers (2021-10-14T17:59:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.