Related papers: MoE Adapter for Large Audio Language Models: Sparsity, Disentanglement, and Gradient-Conflict-Free

MoE Adapter for Large Audio Language Models: Sparsity, Disentanglement, and Gradient-Conflict-Free

URL: http://arxiv.org/abs/2601.02967v2
Date: Thu, 08 Jan 2026 06:17:18 GMT
Title: MoE Adapter for Large Audio Language Models: Sparsity, Disentanglement, and Gradient-Conflict-Free
Authors: Yishu Lei, Shuwei He, Jing Hu, Dan Zhang, Xianlong Luo, Danxiang Zhu, Shikun Feng, Rui Liu, Jingzhou He, Yu Sun, Hua Wu, Haifeng Wang,
Abstract summary: We introduce the textittextbfMoE-Adapter, a sparse Mixture-of-Experts(MoE) architecture designed to decouple acoustic information.<n>Experiments show that the MoE-Adapter achieves superior performance on both audio semantic and paralinguistic tasks.
Score: 27.346096262060787
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Extending the input modality of Large Language Models~(LLMs) to the audio domain is essential for achieving comprehensive multimodal perception. However, it is well-known that acoustic information is intrinsically \textit{heterogeneous}, entangling attributes such as speech, music, and environmental context. Existing research is limited to a dense, parameter-shared adapter to model these diverse patterns, which induces \textit{gradient conflict} during optimization, as parameter updates required for distinct attributes contradict each other. To address this limitation, we introduce the \textit{\textbf{MoE-Adapter}}, a sparse Mixture-of-Experts~(MoE) architecture designed to decouple acoustic information. Specifically, it employs a dynamic gating mechanism that routes audio tokens to specialized experts capturing complementary feature subspaces while retaining shared experts for global context, thereby mitigating gradient conflicts and enabling fine-grained feature learning. Comprehensive experiments show that the MoE-Adapter achieves superior performance on both audio semantic and paralinguistic tasks, consistently outperforming dense linear baselines with comparable computational costs. Furthermore, we will release the related code and models to facilitate future research.

Related papers

MoST: Mixing Speech and Text with Modality-Aware Mixture of Experts [12.42628977620548]
MoST (Mixture of Speech and Text) is a novel large language model that seamlessly integrates speech and text processing.<n>We introduce specialized routing pathways that direct tokens to modality-appropriate experts based on input type.<n>MoST consistently outperforms existing models of comparable parameter counts.
arXiv Detail & Related papers (2026-01-15T10:43:29Z)
UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE [48.211103577288675]
UniMoE-Audio is a unified speech and music generation model within a novel Dynamic-Capacity Mixture-of-Experts (MoE) framework.<n>To tackle data imbalance, we introduce a three-stage training curriculum.<n>UniMoE-Audio achieves state-of-the-art performance on major speech and music generation benchmarks.
arXiv Detail & Related papers (2025-10-15T09:30:25Z)
BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration [56.98981194478512]
We propose a unified framework that handles a broad range of subject-to-video scenarios.<n>We introduce an MLLM-DiT framework in which a pretrained multimodal large language model performs deep cross-modal reasoning to ground entities.<n>Experiments on the OpenS2V benchmark demonstrate that our method achieves superior performance across subject consistency, naturalness, and text relevance in generated videos.
arXiv Detail & Related papers (2025-10-01T02:41:11Z)
UniMMAD: Unified Multi-Modal and Multi-Class Anomaly Detection via MoE-Driven Feature Decompression [74.0893986012049]
UniMMAD is a unified framework for multi-modal and multi-class anomaly detection.<n>UniMMAD achieves state-of-the-art performance on 9 anomaly detection datasets, spanning 3 fields, 12 modalities, and 66 classes.
arXiv Detail & Related papers (2025-09-30T08:29:12Z)
High-Quality Sound Separation Across Diverse Categories via Visually-Guided Generative Modeling [65.02357548201188]
We propose DAVIS, a Diffusion-based Audio-VIsual Separation framework that solves the audio-visual sound source separation task through generative learning.<n>Our framework operates by synthesizing the desired separated sound spectrograms directly from a noise distribution, conditioned concurrently on the mixed audio input and associated visual information.
arXiv Detail & Related papers (2025-09-26T08:46:00Z)
HarmoniFuse: A Component-Selective and Prompt-Adaptive Framework for Multi-Task Speech Language Modeling [52.537908557508324]
HarmoniFuse is a component-selective and prompt-adaptive framework for multi-task speech language modeling.<n>A batch-interleaved training strategy enables leveraging separate ASR and SER datasets without requiring joint annotation.
arXiv Detail & Related papers (2025-09-23T02:53:38Z)
SPANER: Shared Prompt Aligner for Multimodal Semantic Representation [0.0]
Shared Prompt AligNER (SPANER) is a modality-agnostic PEFT framework designed to embed inputs from diverse modalities into a unified semantic space.<n>SPANER employs a shared prompt mechanism that acts as a conceptual anchor, enabling semantically related instances to converge spatially regardless of modality.<n>Our results highlight the importance of aligning embedding structures, rather than merely tuning adapter weights, for scalable multimodal learning.
arXiv Detail & Related papers (2025-08-18T22:20:42Z)
Text-Queried Audio Source Separation via Hierarchical Modeling [53.94434504259829]
We propose a hierarchical decomposition framework, HSM-TSS, that decouples the task into global-local semantic-guided feature separation and structure-preserving acoustic reconstruction.<n>A Q-Audio architecture is employed to align audio and text modalities, serving as pretrained global-semantic encoders.<n>Our method achieves state-of-the-art separation performance with data-efficient training while maintaining superior semantic consistency with queries in complex auditory scenes.
arXiv Detail & Related papers (2025-05-27T11:00:38Z)
Spatial-Temporal-Spectral Unified Modeling for Remote Sensing Dense Prediction [20.1863553357121]
Current deep learning architectures for remote sensing are fundamentally rigid.<n>We introduce the Spatial-Temporal-Spectral Unified Network (STSUN) for unified modeling.<n> STSUN can adapt to input and output data with arbitrary spatial sizes, temporal lengths, and spectral bands.<n>It unifies various dense prediction tasks and diverse semantic class predictions.
arXiv Detail & Related papers (2025-05-18T07:39:17Z)
SEAL: Speech Embedding Alignment Learning for Speech Large Language Model with Retrieval-Augmented Generation [10.828717295018123]
We propose a unified embedding framework that eliminates the need for intermediate text representations.<n>Our model reduces pipeline latency by 50% while achieving higher retrieval accuracy compared to traditional two-stage methods.
arXiv Detail & Related papers (2025-01-26T15:04:02Z)
WavLLM: Towards Robust and Adaptive Speech Large Language Model [93.0773293897888]
We introduce WavLLM, a robust and adaptive speech large language model with dual encoders, and a prompt-aware LoRA weight adapter. We validate the proposed model on universal speech benchmarks including tasks such as ASR, ST, SV, ER, and also apply it to specialized datasets like Gaokao English listening comprehension set for SQA, and speech Chain-of-Thought (CoT) evaluation set.
arXiv Detail & Related papers (2024-03-31T12:01:32Z)
Echotune: A Modular Extractor Leveraging the Variable-Length Nature of Speech in ASR Tasks [4.132793413136553]
We introduce Echo-MSA, a nimble module equipped with a variable-length attention mechanism. The proposed design captures the variable length feature of speech and addresses the limitations of fixed-length attention.
arXiv Detail & Related papers (2023-09-14T14:51:51Z)
Dynamic Latent Separation for Deep Learning [67.62190501599176]
A core problem in machine learning is to learn expressive latent variables for model prediction on complex data. Here, we develop an approach that improves expressiveness, provides partial interpretation, and is not restricted to specific applications.
arXiv Detail & Related papers (2022-10-07T17:56:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.