Designing Practical Models for Isolated Word Visual Speech Recognition
- URL: http://arxiv.org/abs/2508.17894v1
- Date: Mon, 25 Aug 2025 11:04:36 GMT
- Title: Designing Practical Models for Isolated Word Visual Speech Recognition
- Authors: Iason Ioannis Panagos, Giorgos Sfikas, Christophoros Nikou,
- Abstract summary: Visual speech recognition (VSR) systems decode spoken words from an input sequence using only the video data.<n>Practical applications of such systems include medical assistance as well as human-machine interactions.<n>We develop lightweight end-to-end architectures by first efficient models from the image classification literature, and then adopting lightweight block designs in a temporal convolution network backbone.
- Score: 9.502316537342372
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual speech recognition (VSR) systems decode spoken words from an input sequence using only the video data. Practical applications of such systems include medical assistance as well as human-machine interactions. A VSR system is typically employed in a complementary role in cases where the audio is corrupt or not available. In order to accurately predict the spoken words, these architectures often rely on deep neural networks in order to extract meaningful representations from the input sequence. While deep architectures achieve impressive recognition performance, relying on such models incurs significant computation costs which translates into increased resource demands in terms of hardware requirements and results in limited applicability in real-world scenarios where resources might be constrained. This factor prevents wider adoption and deployment of speech recognition systems in more practical applications. In this work, we aim to alleviate this issue by developing architectures for VSR that have low hardware costs. Following the standard two-network design paradigm, where one network handles visual feature extraction and another one utilizes the extracted features to classify the entire sequence, we develop lightweight end-to-end architectures by first benchmarking efficient models from the image classification literature, and then adopting lightweight block designs in a temporal convolution network backbone. We create several unified models with low resource requirements but strong recognition performance. Experiments on the largest public database for English words demonstrate the effectiveness and practicality of our developed models. Code and trained models will be made publicly available.
Related papers
- LaVPR: Benchmarking Language and Vision for Place Recognition [16.711881128691605]
We introduce LaVPR, a large-scale benchmark that extends existing VPR datasets with over 650,000 rich natural-language descriptions.<n>Our results show that language descriptions yield consistent gains in visually degraded conditions.<n>LaVPR enables a new class of localization systems that are both resilient to real-worldity and practical for resource-constrained deployment.
arXiv Detail & Related papers (2026-02-03T08:38:38Z) - Keep what you need : extracting efficient subnetworks from large audio representation models [0.8798470556253869]
We introduce learnable binary masks in-between the layers of a pretrained representation model.<n>When training the end-to-end model on a downstream task, we add a sparsity-inducing loss to the overall objective.<n>Once trained, the masked computational units can then be removed from the network, implying significant performance gains.
arXiv Detail & Related papers (2025-02-18T15:04:33Z) - Lightweight Operations for Visual Speech Recognition [5.254384872541785]
We develop lightweight visual speech recognition architectures for resource-constrained devices.<n>We train and evaluate our models on a large-scale public dataset for recognition of words from video sequences.
arXiv Detail & Related papers (2025-02-07T11:08:32Z) - SEAL: Speech Embedding Alignment Learning for Speech Large Language Model with Retrieval-Augmented Generation [10.828717295018123]
We propose a unified embedding framework that eliminates the need for intermediate text representations.<n>Our model reduces pipeline latency by 50% while achieving higher retrieval accuracy compared to traditional two-stage methods.
arXiv Detail & Related papers (2025-01-26T15:04:02Z) - Computer Vision Model Compression Techniques for Embedded Systems: A Survey [75.38606213726906]
This paper covers the main model compression techniques applied for computer vision tasks.
We present the characteristics of compression subareas, compare different approaches, and discuss how to choose the best technique.
We also share codes to assist researchers and new practitioners in overcoming initial implementation challenges.
arXiv Detail & Related papers (2024-08-15T16:41:55Z) - Tailored Design of Audio-Visual Speech Recognition Models using Branchformers [0.0]
We propose a novel framework for the design of parameter-efficient Audio-Visual Speech Recognition systems.<n>To be more precise, the proposed framework consists of two steps: first, estimating audio- and video-only systems, and then designing a tailored audio-visual unified encoder.<n>Our models achieve competitive word error rates (WER) of approximately 2.5% for English and surpass existing approaches for Spanish.
arXiv Detail & Related papers (2024-07-09T07:15:56Z) - SOLO: A Single Transformer for Scalable Vision-Language Modeling [74.05173379908703]
We present SOLO, a single transformer for visiOn-Language mOdeling.<n>A unified single Transformer architecture, like SOLO, effectively addresses these scalability concerns in LVLMs.<n>In this paper, we introduce the first open-source training recipe for developing SOLO, an open-source 7B LVLM.
arXiv Detail & Related papers (2024-07-08T22:40:15Z) - Conformer LLMs -- Convolution Augmented Large Language Models [2.8935588665357077]
This work builds together two popular blocks of neural architecture, namely convolutional layers and Transformers, for large language models (LLMs)
Transformers decoders effectively capture long-range dependencies over several modalities and form a core backbone of modern advancements in machine learning.
This work showcases a robust speech architecture that can be integrated and adapted in a causal setup beyond speech applications for large-scale language modeling.
arXiv Detail & Related papers (2023-07-02T03:05:41Z) - A Study of Designing Compact Audio-Visual Wake Word Spotting System
Based on Iterative Fine-Tuning in Neural Network Pruning [57.28467469709369]
We investigate on designing a compact audio-visual wake word spotting (WWS) system by utilizing visual information.
We introduce a neural network pruning strategy via the lottery ticket hypothesis in an iterative fine-tuning manner (LTH-IF)
The proposed audio-visual system achieves significant performance improvements over the single-modality (audio-only or video-only) system under different noisy conditions.
arXiv Detail & Related papers (2022-02-17T08:26:25Z) - Speech Command Recognition in Computationally Constrained Environments
with a Quadratic Self-organized Operational Layer [92.37382674655942]
We propose a network layer to enhance the speech command recognition capability of a lightweight network.
The employed method borrows the ideas of Taylor expansion and quadratic forms to construct a better representation of features in both input and hidden layers.
This richer representation results in recognition accuracy improvement as shown by extensive experiments on Google speech commands (GSC) and synthetic speech commands (SSC) datasets.
arXiv Detail & Related papers (2020-11-23T14:40:18Z) - AutoSpeech: Neural Architecture Search for Speaker Recognition [108.69505815793028]
We propose the first neural architecture search approach approach for the speaker recognition tasks, named as AutoSpeech.
Our algorithm first identifies the optimal operation combination in a neural cell and then derives a CNN model by stacking the neural cell for multiple times.
Results demonstrate that the derived CNN architectures significantly outperform current speaker recognition systems based on VGG-M, ResNet-18, and ResNet-34 back-bones, while enjoying lower model complexity.
arXiv Detail & Related papers (2020-05-07T02:53:47Z) - Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR.
We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism.
Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.