Lightweight Operations for Visual Speech Recognition
- URL: http://arxiv.org/abs/2502.04834v1
- Date: Fri, 07 Feb 2025 11:08:32 GMT
- Title: Lightweight Operations for Visual Speech Recognition
- Authors: Iason Ioannis Panagos, Giorgos Sfikas, Christophoros Nikou,
- Abstract summary: We develop lightweight visual speech recognition architectures for resource-constrained devices.<n>We train and evaluate our models on a large-scale public dataset for recognition of words from video sequences.
- Score: 5.254384872541785
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual speech recognition (VSR), which decodes spoken words from video data, offers significant benefits, particularly when audio is unavailable. However, the high dimensionality of video data leads to prohibitive computational costs that demand powerful hardware, limiting VSR deployment on resource-constrained devices. This work addresses this limitation by developing lightweight VSR architectures. Leveraging efficient operation design paradigms, we create compact yet powerful models with reduced resource requirements and minimal accuracy loss. We train and evaluate our models on a large-scale public dataset for recognition of words from video sequences, demonstrating their effectiveness for practical applications. We also conduct an extensive array of ablative experiments to thoroughly analyze the size and complexity of each model. Code and trained models will be made publicly available.
Related papers
- Designing Practical Models for Isolated Word Visual Speech Recognition [9.502316537342372]
Visual speech recognition (VSR) systems decode spoken words from an input sequence using only the video data.<n>Practical applications of such systems include medical assistance as well as human-machine interactions.<n>We develop lightweight end-to-end architectures by first efficient models from the image classification literature, and then adopting lightweight block designs in a temporal convolution network backbone.
arXiv Detail & Related papers (2025-08-25T11:04:36Z) - ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts [56.75723197779384]
ARC-Hunyuan-Video is a multimodal model that processes visual, audio, and textual signals end-to-end for structured comprehension.<n>Our model is capable of multi-granularity timestamped video captioning and summarization, open-ended video question answering, temporal video grounding, and video reasoning.
arXiv Detail & Related papers (2025-07-28T15:52:36Z) - AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented Efficient Long Video Understanding [73.60257070465377]
AdaVideoRAG is a novel framework that adapts retrieval based on query complexity using a lightweight intent classifier.<n>Our framework employs an Omni-Knowledge Indexing module to build hierarchical databases from text (captions, ASR, OCR), visual features, and semantic graphs.<n> Experiments demonstrate improved efficiency and accuracy for long-video understanding, with seamless integration into existing MLLMs.
arXiv Detail & Related papers (2025-06-16T15:18:15Z) - Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs [33.12165044958361]
Recent advancements in Large Language Models (LLMs) have demonstrated their effectiveness in speech recognition, including Audio-Visual Speech Recognition (AVSR)
Due to the significant length of speech representations, direct integration with LLMs imposes substantial computational costs.
We propose Llama-MTSK, the first Matryoshka-based Multimodal LLM for AVSR, which enables flexible adaptation of the audio-visual token allocation.
arXiv Detail & Related papers (2025-03-09T00:02:10Z) - TinyLLaVA-Video: Towards Smaller LMMs for Video Understanding with Group Resampler [10.92767902813594]
We introduce TinyLLaVA-Video, a lightweight yet powerful video understanding model with approximately 3.6B parameters.<n>The cornerstone of our design is the video-level group resampler, a novel mechanism that significantly reduces and controls the number of visual tokens at the video level.<n>TinyLLaVA-Video demonstrates exceptional efficiency, requiring only one day of training on 8 A100-40G GPUs.
arXiv Detail & Related papers (2025-01-26T13:10:12Z) - Large Language Models Are Strong Audio-Visual Speech Recognition Learners [53.142635674428874]
Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities.
We propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities.
We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.81% and 0.77%, respectively.
arXiv Detail & Related papers (2024-09-18T21:17:27Z) - SOLO: A Single Transformer for Scalable Vision-Language Modeling [74.05173379908703]
We present SOLO, a single transformer for visiOn-Language mOdeling.<n>A unified single Transformer architecture, like SOLO, effectively addresses these scalability concerns in LVLMs.<n>In this paper, we introduce the first open-source training recipe for developing SOLO, an open-source 7B LVLM.
arXiv Detail & Related papers (2024-07-08T22:40:15Z) - Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech Representations [16.577870835480585]
We present a comprehensive analysis on building ASR systems with discrete codes.
We investigate different methods for training such as quantization schemes and time-domain vs spectral feature encodings.
We introduce a pipeline that outperforms Encodec at similar bit-rate.
arXiv Detail & Related papers (2024-07-03T20:51:41Z) - Extreme Encoder Output Frame Rate Reduction: Improving Computational
Latencies of Large End-to-End Models [59.57732929473519]
We apply multiple frame reduction layers in the encoder to compress encoder outputs into a small number of output frames.
We demonstrate that we can generate one encoder output frame for every 2.56 sec of input speech, without significantly affecting word error rate on a large-scale voice search task.
arXiv Detail & Related papers (2024-02-27T03:40:44Z) - Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation [55.15299351110525]
This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can recognize different languages with a single trained model.
We propose a novel training strategy, processing with visual speech units.
We set new state-of-the-art multilingual VSR performances by achieving comparable performances to the previous language-specific VSR models.
arXiv Detail & Related papers (2024-01-18T08:46:02Z) - In-context Autoencoder for Context Compression in a Large Language Model [70.7621953091318]
We propose the In-context Autoencoder (ICAE) to compress a long context into short compact memory slots.
ICAE is first pretrained using both autoencoding and language modeling objectives on massive text data.
arXiv Detail & Related papers (2023-07-13T17:59:21Z) - A Light Weight Model for Active Speaker Detection [7.253335671577093]
We construct a lightweight active speaker detection architecture by reducing input candidates, splitting 2D and 3D convolutions for audio-visual feature extraction, and applying gated recurrent unit (GRU) with low computational complexity for cross-modal modeling.
Experimental results on the AVA-ActiveSpeaker dataset show that our framework achieves competitive mAP performance (94.1% vs. 94.2%).
Our framework also performs well on the Columbia dataset showing good robustness.
arXiv Detail & Related papers (2023-03-08T08:40:56Z) - Knowledge Transfer For On-Device Speech Emotion Recognition with Neural
Structured Learning [19.220263739291685]
Speech emotion recognition (SER) has been a popular research topic in human-computer interaction (HCI)
We propose a neural structured learning (NSL) framework through building synthesized graphs.
Our experiments demonstrate that training a lightweight SER model on the target dataset with speech samples and graphs can not only produce small SER models, but also enhance the model performance.
arXiv Detail & Related papers (2022-10-26T18:38:42Z) - SVTS: Scalable Video-to-Speech Synthesis [105.29009019733803]
We introduce a scalable video-to-speech framework consisting of two components: a video-to-spectrogram predictor and a pre-trained neural vocoder.
We are the first to show intelligible results on the challenging LRS3 dataset.
arXiv Detail & Related papers (2022-05-04T13:34:07Z) - Visual Speech Recognition for Multiple Languages in the Wild [64.52593130370757]
We show that designing better VSR models is equally important to using larger training sets.
We propose the addition of prediction-based auxiliary tasks to a VSR model.
We show that such model works for different languages and outperforms all previous methods trained on publicly available datasets by a large margin.
arXiv Detail & Related papers (2022-02-26T07:21:00Z) - Weakly Supervised Construction of ASR Systems with Massive Video Data [18.5050375783871]
We present a weakly supervised framework for constructing ASR systems with massive video data.
We propose an effective approach to extract high-quality audios aligned with transcripts from videos based on Optical Character Recognition (OCR)
Our framework can easily produce state-of-the-art results on six public datasets for Mandarin speech recognition.
arXiv Detail & Related papers (2020-08-04T03:11:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.