Related papers: Lightweight Operations for Visual Speech Recognition

Related papers

Designing Practical Models for Isolated Word Visual Speech Recognition [9.502316537342372]
Visual speech recognition (VSR) systems decode spoken words from an input sequence using only the video data.<n>Practical applications of such systems include medical assistance as well as human-machine interactions.<n>We develop lightweight end-to-end architectures by first efficient models from the image classification literature, and then adopting lightweight block designs in a temporal convolution network backbone.
arXiv Detail & Related papers (2025-08-25T11:04:36Z)
ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts [56.75723197779384]
ARC-Hunyuan-Video is a multimodal model that processes visual, audio, and textual signals end-to-end for structured comprehension.<n>Our model is capable of multi-granularity timestamped video captioning and summarization, open-ended video question answering, temporal video grounding, and video reasoning.
arXiv Detail & Related papers (2025-07-28T15:52:36Z)
AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented Efficient Long Video Understanding [73.60257070465377]
AdaVideoRAG is a novel framework that adapts retrieval based on query complexity using a lightweight intent classifier.<n>Our framework employs an Omni-Knowledge Indexing module to build hierarchical databases from text (captions, ASR, OCR), visual features, and semantic graphs.<n> Experiments demonstrate improved efficiency and accuracy for long-video understanding, with seamless integration into existing MLLMs.
arXiv Detail & Related papers (2025-06-16T15:18:15Z)
Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs [33.12165044958361]
Recent advancements in Large Language Models (LLMs) have demonstrated their effectiveness in speech recognition, including Audio-Visual Speech Recognition (AVSR) Due to the significant length of speech representations, direct integration with LLMs imposes substantial computational costs. We propose Llama-MTSK, the first Matryoshka-based Multimodal LLM for AVSR, which enables flexible adaptation of the audio-visual token allocation.
arXiv Detail & Related papers (2025-03-09T00:02:10Z)
TinyLLaVA-Video: Towards Smaller LMMs for Video Understanding with Group Resampler [10.92767902813594]
We introduce TinyLLaVA-Video, a lightweight yet powerful video understanding model with approximately 3.6B parameters.<n>The cornerstone of our design is the video-level group resampler, a novel mechanism that significantly reduces and controls the number of visual tokens at the video level.<n>TinyLLaVA-Video demonstrates exceptional efficiency, requiring only one day of training on 8 A100-40G GPUs.
arXiv Detail & Related papers (2025-01-26T13:10:12Z)
Large Language Models Are Strong Audio-Visual Speech Recognition Learners [53.142635674428874]
Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities. We propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities. We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.81% and 0.77%, respectively.
arXiv Detail & Related papers (2024-09-18T21:17:27Z)
SOLO: A Single Transformer for Scalable Vision-Language Modeling [74.05173379908703]
We present SOLO, a single transformer for visiOn-Language mOdeling.<n>A unified single Transformer architecture, like SOLO, effectively addresses these scalability concerns in LVLMs.<n>In this paper, we introduce the first open-source training recipe for developing SOLO, an open-source 7B LVLM.
arXiv Detail & Related papers (2024-07-08T22:40:15Z)
Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech Representations [16.577870835480585]
We present a comprehensive analysis on building ASR systems with discrete codes. We investigate different methods for training such as quantization schemes and time-domain vs spectral feature encodings. We introduce a pipeline that outperforms Encodec at similar bit-rate.
arXiv Detail & Related papers (2024-07-03T20:51:41Z)
Extreme Encoder Output Frame Rate Reduction: Improving Computational Latencies of Large End-to-End Models [59.57732929473519]
We apply multiple frame reduction layers in the encoder to compress encoder outputs into a small number of output frames. We demonstrate that we can generate one encoder output frame for every 2.56 sec of input speech, without significantly affecting word error rate on a large-scale voice search task.
arXiv Detail & Related papers (2024-02-27T03:40:44Z)
Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation [55.15299351110525]
This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can recognize different languages with a single trained model. We propose a novel training strategy, processing with visual speech units. We set new state-of-the-art multilingual VSR performances by achieving comparable performances to the previous language-specific VSR models.
arXiv Detail & Related papers (2024-01-18T08:46:02Z)
In-context Autoencoder for Context Compression in a Large Language Model [70.7621953091318]
We propose the In-context Autoencoder (ICAE) to compress a long context into short compact memory slots. ICAE is first pretrained using both autoencoding and language modeling objectives on massive text data.
arXiv Detail & Related papers (2023-07-13T17:59:21Z)
A Light Weight Model for Active Speaker Detection [7.253335671577093]
We construct a lightweight active speaker detection architecture by reducing input candidates, splitting 2D and 3D convolutions for audio-visual feature extraction, and applying gated recurrent unit (GRU) with low computational complexity for cross-modal modeling. Experimental results on the AVA-ActiveSpeaker dataset show that our framework achieves competitive mAP performance (94.1% vs. 94.2%). Our framework also performs well on the Columbia dataset showing good robustness.
arXiv Detail & Related papers (2023-03-08T08:40:56Z)
Knowledge Transfer For On-Device Speech Emotion Recognition with Neural Structured Learning [19.220263739291685]
Speech emotion recognition (SER) has been a popular research topic in human-computer interaction (HCI) We propose a neural structured learning (NSL) framework through building synthesized graphs. Our experiments demonstrate that training a lightweight SER model on the target dataset with speech samples and graphs can not only produce small SER models, but also enhance the model performance.
arXiv Detail & Related papers (2022-10-26T18:38:42Z)
SVTS: Scalable Video-to-Speech Synthesis [105.29009019733803]
We introduce a scalable video-to-speech framework consisting of two components: a video-to-spectrogram predictor and a pre-trained neural vocoder. We are the first to show intelligible results on the challenging LRS3 dataset.
arXiv Detail & Related papers (2022-05-04T13:34:07Z)
Visual Speech Recognition for Multiple Languages in the Wild [64.52593130370757]
We show that designing better VSR models is equally important to using larger training sets. We propose the addition of prediction-based auxiliary tasks to a VSR model. We show that such model works for different languages and outperforms all previous methods trained on publicly available datasets by a large margin.
arXiv Detail & Related papers (2022-02-26T07:21:00Z)
Weakly Supervised Construction of ASR Systems with Massive Video Data [18.5050375783871]
We present a weakly supervised framework for constructing ASR systems with massive video data. We propose an effective approach to extract high-quality audios aligned with transcripts from videos based on Optical Character Recognition (OCR) Our framework can easily produce state-of-the-art results on six public datasets for Mandarin speech recognition.
arXiv Detail & Related papers (2020-08-04T03:11:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.