MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition
- URL: http://arxiv.org/abs/2510.04136v1
- Date: Sun, 05 Oct 2025 10:34:34 GMT
- Title: MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition
- Authors: Umberto Cappellazzo, Minsu Kim, Pingchuan Ma, Honglie Chen, Xubo Liu, Stavros Petridis, Maja Pantic,
- Abstract summary: Matryoshka representation learning (MRL) addresses this by enabling a single model to operate across multiple token granularities.<n>MoME is a novel framework that integrates sparse Mixture-of-Experts (MoE) into MRL-based large language models for speech recognition.<n>MoME achieves state-of-the-art performance across AVSR, ASR, and VSR tasks, while requiring significantly fewer parameters.
- Score: 39.90876258237132
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have recently shown strong potential in audio-visual speech recognition (AVSR), but their high computational demands and sensitivity to token granularity limit their practicality in resource-constrained settings. Token compression methods can reduce inference cost, but they require fixing a compression rate in advance and produce a single fixed-length output, offering no flexibility to balance information density and efficiency at inference time. Matryoshka representation learning (MRL) addresses this by enabling a single model to operate across multiple token granularities, allowing compression rates to be adjusted dynamically. However, current MRL-based methods treat each scale independently during training, limiting cross-scale generalization, robustness at high compression, and interpretability. To overcome these limitations, we propose MoME (Mixture of Matryoshka Experts), a novel framework that integrates sparse Mixture-of-Experts (MoE) into MRL-based LLMs for AVSR. MoME augments a frozen LLM with top-k routed and shared experts, allowing dynamic capacity allocation across scales and modalities. A shared router promotes consistent expert activation across granularities, enabling compressed sequences to benefit from representations learned at lower compression. Experiments on LRS2 and LRS3 demonstrate that MoME achieves state-of-the-art performance across AVSR, ASR, and VSR tasks, while requiring significantly fewer parameters and maintaining robustness under noise. MoME unifies the adaptability of MRL with the efficiency of MoE, offering a scalable and interpretable solution for resource-aware speech recognition.
Related papers
- Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models [34.15708407614003]
Large language models (LLMs) have recently achieved impressive results in speech recognition across multiple modalities.<n>We present Omni-AVSR, a unified audio-visual LLM that combines efficient multi-granularity training with parameter-efficient adaptation.<n> Experiments on LRS2 and LRS3 show that Omni-AVSR achieves comparable or superior accuracy to state-of-the-art baselines.
arXiv Detail & Related papers (2025-11-10T16:03:44Z) - MC#: Mixture Compressor for Mixture-of-Experts Large Models [86.64315380917827]
Mixture-of-Experts (MoE) effectively scales large language models (LLMs) and vision-language models (VLMs) by increasing capacity through sparse activation.<n>We propose MC# (Mixture-Compressor-sharp), a framework that combines static quantization and dynamic expert pruning.
arXiv Detail & Related papers (2025-10-13T03:12:46Z) - Fun-ASR Technical Report [89.84148151617022]
We present Fun-ASR, a large-scale, LLM-based ASR system that combines massive data, large model capacity, LLM integration, and reinforcement learning.<n>Fun-ASR is specifically optimized for practical deployment, with enhancements in streaming capability, noise robustness, code-switching, hotword customization, and satisfying other real-world application requirements.<n>Thanks to production-oriented optimizations, Fun-ASR achieves state-of-the-art performance on real application datasets, demonstrating its effectiveness and robustness in practical settings.
arXiv Detail & Related papers (2025-09-15T23:19:36Z) - Joint Learning using Mixture-of-Expert-Based Representation for Enhanced Speech Generation and Robust Emotion Recognition [54.44798086835314]
Speech emotion recognition (SER) plays a critical role in building emotion-aware speech systems, but its performance degrades significantly under noisy conditions.<n>We propose the Sparse Mixture-of-Experts Representation Integration Technique (Sparse MERIT), a flexible MTL framework that applies frame-wise expert routing over self-supervised speech representations.<n> Experiments on the MSP-Podcast corpus show that Sparse MERIT consistently outperforms baseline models on both SER and SE tasks.
arXiv Detail & Related papers (2025-09-10T10:18:56Z) - Scaling and Enhancing LLM-based AVSR: A Sparse Mixture of Projectors Approach [37.690797152736465]
Llama-SMoP employs a Sparse Mixture of Projectors (SMoP) module to scale model capacity without increasing inference costs.<n>It achieves superior performance on ASR, VSR, and AVSR tasks.
arXiv Detail & Related papers (2025-05-20T13:20:55Z) - Q-MambaIR: Accurate Quantized Mamba for Efficient Image Restoration [34.43633070396096]
State-Space Models (SSMs) have attracted considerable attention in Image Restoration (IR)<n>Q-MambaIR is an accurate, efficient, and flexible Quantized Mamba for IR tasks.
arXiv Detail & Related papers (2025-03-27T20:34:11Z) - ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration [61.579842548990754]
Mixture-of-Experts (MoE) Transformer, the backbone of multiple phenomenal language models, leverages sparsity by activating only a fraction of model parameters for each input token.<n>We introduce ResMoE, an innovative MoE approximation framework that utilizes Wasserstein barycenter to extract a common expert (barycenter expert) and approximate the residuals between this barycenter expert and the original ones.
arXiv Detail & Related papers (2025-03-10T03:15:54Z) - Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs [33.12165044958361]
Recent advances in Large Language Models (LLMs) show strong performance in speech recognition, including Audio-Visual Speech Recognition (AVSR)<n>To address this, we propose Llama-MTSK, the first Matryoshka-based Multimodal LLM for AVSR.<n>Inspired by Matryoshka Representation Learning, our model encodes representations at multiple granularities with a single architecture.<n>For efficient fine-tuning, we introduce three LoRA-based strategies using global and scale-specific modules.
arXiv Detail & Related papers (2025-03-09T00:02:10Z) - TensorLLM: Tensorising Multi-Head Attention for Enhanced Reasoning and Compression in LLMs [3.808154352665581]
We propose a novel framework that performs MHA compression through a multi-head tensorisation process and the Tucker decomposition.<n>We demonstrate that this approach consistently enhances the reasoning capabilities of LLMs across multiple benchmark datasets.<n>We show that the proposed method can be seamlessly combined with existing FFN-only-based denoising techniques to achieve further improvements in LLM reasoning performance.
arXiv Detail & Related papers (2025-01-26T21:05:16Z) - Mixture Compressor for Mixture-of-Experts LLMs Gains More [71.0473038084673]
We propose a training-free Mixture-Compressor for Mixture-of-Experts large language models (MoE-LLMs)<n>Our MC integrates static quantization and dynamic pruning to collaboratively achieve extreme compression for MoE-LLMs with less accuracy loss.<n>For instance, at 2.54 bits, MC compresses 76.6% of the model, with only a 3.8% average accuracy loss.
arXiv Detail & Related papers (2024-10-08T18:09:38Z) - Can SAM Boost Video Super-Resolution? [78.29033914169025]
We propose a simple yet effective module -- SAM-guidEd refinEment Module (SEEM)
This light-weight plug-in module is specifically designed to leverage the attention mechanism for the generation of semantic-aware feature.
We apply our SEEM to two representative methods, EDVR and BasicVSR, resulting in consistently improved performance with minimal implementation effort.
arXiv Detail & Related papers (2023-05-11T02:02:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.