Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models
- URL: http://arxiv.org/abs/2511.07253v1
- Date: Mon, 10 Nov 2025 16:03:44 GMT
- Title: Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models
- Authors: Umberto Cappellazzo, Xubo Liu, Pingchuan Ma, Stavros Petridis, Maja Pantic,
- Abstract summary: Large language models (LLMs) have recently achieved impressive results in speech recognition across multiple modalities.<n>We present Omni-AVSR, a unified audio-visual LLM that combines efficient multi-granularity training with parameter-efficient adaptation.<n> Experiments on LRS2 and LRS3 show that Omni-AVSR achieves comparable or superior accuracy to state-of-the-art baselines.
- Score: 34.15708407614003
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have recently achieved impressive results in speech recognition across multiple modalities, including Auditory Speech Recognition (ASR), Visual Speech Recognition (VSR), and Audio-Visual Speech Recognition (AVSR). Despite this progress, current LLM-based approaches typically address each task independently, training separate models that raise computational and deployment resource use while missing potential cross-task synergies. They also rely on fixed-rate token compression, which restricts flexibility in balancing accuracy with efficiency. These limitations highlight the need for a unified framework that can support ASR, VSR, and AVSR while enabling elastic inference. To this end, we present Omni-AVSR, a unified audio-visual LLM that combines efficient multi-granularity training with parameter-efficient adaptation. Specifically, we adapt the matryoshka representation learning paradigm to efficiently train across multiple audio and visual granularities, reducing its inherent training resource use. Furthermore, we explore three LoRA-based strategies for adapting the backbone LLM, balancing shared and task-specific specialization. Experiments on LRS2 and LRS3 show that Omni-AVSR achieves comparable or superior accuracy to state-of-the-art baselines while training a single model at substantially lower training and deployment resource use. The model also remains robust under acoustic noise, and we analyze its scaling behavior as LLM size increases, providing insights into the trade-off between performance and efficiency.
Related papers
- Fun-ASR Technical Report [89.84148151617022]
We present Fun-ASR, a large-scale, LLM-based ASR system that combines massive data, large model capacity, LLM integration, and reinforcement learning.<n>Fun-ASR is specifically optimized for practical deployment, with enhancements in streaming capability, noise robustness, code-switching, hotword customization, and satisfying other real-world application requirements.<n>Thanks to production-oriented optimizations, Fun-ASR achieves state-of-the-art performance on real application datasets, demonstrating its effectiveness and robustness in practical settings.
arXiv Detail & Related papers (2025-09-15T23:19:36Z) - LiSTEN: Learning Soft Token Embeddings for Neural Audio LLMs [29.853196429972204]
LiSTEN is a framework for adapting large language models to audio-language tasks.<n>Our approach reduces dependence on large-scale ASR or captioning datasets, achieves competitive performance with fewer trainable parameters, and simplifies training by using a single-stage process.
arXiv Detail & Related papers (2025-05-24T05:28:22Z) - Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs [33.12165044958361]
Recent advances in Large Language Models (LLMs) show strong performance in speech recognition, including Audio-Visual Speech Recognition (AVSR)<n>To address this, we propose Llama-MTSK, the first Matryoshka-based Multimodal LLM for AVSR.<n>Inspired by Matryoshka Representation Learning, our model encodes representations at multiple granularities with a single architecture.<n>For efficient fine-tuning, we introduce three LoRA-based strategies using global and scale-specific modules.
arXiv Detail & Related papers (2025-03-09T00:02:10Z) - Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition [57.131546757903834]
Lyra is an efficient MLLM that enhances multimodal abilities, including advanced long-speech comprehension, sound understanding, cross-modality efficiency, and seamless speech interaction.<n>Lyra achieves state-of-the-art performance on various vision-language, vision-speech, and speech-language benchmarks, while also using fewer computational resources and less training data.
arXiv Detail & Related papers (2024-12-12T17:50:39Z) - Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs [73.74375912785689]
This paper proposes unified training strategies for speech recognition systems.
We demonstrate that training a single model for all three tasks enhances VSR and AVSR performance.
We also introduce a greedy pseudo-labelling approach to more effectively leverage unlabelled samples.
arXiv Detail & Related papers (2024-11-04T16:46:53Z) - Large Language Models are Strong Audio-Visual Speech Recognition Learners [53.142635674428874]
Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities.<n>We propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities.<n>We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.79% and 0.77%, respectively.
arXiv Detail & Related papers (2024-09-18T21:17:27Z) - Exploring the Integration of Speech Separation and Recognition with
Self-Supervised Learning Representation [83.36685075570232]
This work provides an insightful investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end.
We explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model.
A proposed integration using TF-GridNet-based complex spectral mapping and WavLM-based SSLR achieves a 2.5% word error rate in reverberant WHAMR! test set.
arXiv Detail & Related papers (2023-07-23T05:39:39Z) - AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot
AV-ASR [79.21857972093332]
We present AVFormer, a method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation.
We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters.
We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively.
arXiv Detail & Related papers (2023-03-29T07:24:28Z) - Leveraging Modality-specific Representations for Audio-visual Speech
Recognition via Reinforcement Learning [25.743503223389784]
We propose a reinforcement learning (RL) based framework called MSRL.
We customize a reward function directly related to task-specific metrics.
Experimental results on the LRS3 dataset show that the proposed method achieves state-of-the-art in both clean and various noisy conditions.
arXiv Detail & Related papers (2022-12-10T14:01:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.