Related papers: SLAM-LLM: A Modular, Open-Source Multimodal Large Language Model Framework and Best Practice for Speech, Language, Audio and Music Processing

SLAM-LLM: A Modular, Open-Source Multimodal Large Language Model Framework and Best Practice for Speech, Language, Audio and Music Processing

URL: http://arxiv.org/abs/2601.09385v1
Date: Wed, 14 Jan 2026 11:25:36 GMT
Title: SLAM-LLM: A Modular, Open-Source Multimodal Large Language Model Framework and Best Practice for Speech, Language, Audio and Music Processing
Authors: Ziyang Ma, Guanrou Yang, Wenxi Chen, Zhifu Gao, Yexing Du, Xiquan Li, Zhisheng Zheng, Haina Zhu, Jianheng Zhuo, Zheshu Song, Ruiyang Xu, Tiranrui Wang, Yifan Yang, Yanqiao Zhu, Zhikang Niu, Liumeng Xue, Yinghao Ma, Ruibin Yuan, Shiliang Zhang, Kai Yu, Eng Siong Chng, Xie Chen,
Abstract summary: SLAM-LLM is an open-source framework designed to train customized Multimodal Large Language Models (MLLMs)<n>It provides a modular configuration of different encoders, projectors, LLMs, and parameter-efficient fine-tuning plugins.<n>It includes high-performance checkpoints like Automatic Speech Recognition (ASR), Automated Audio Captioning (AAC), and Music Captioning (MC)
Score: 77.87631792556942
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The recent surge in open-source Multimodal Large Language Models (MLLM) frameworks, such as LLaVA, provides a convenient kickoff for artificial intelligence developers and researchers. However, most of the MLLM frameworks take vision as the main input modality, and provide limited in-depth support for the modality of speech, audio, and music. This situation hinders the development of audio-language models, and forces researchers to spend a lot of effort on code writing and hyperparameter tuning. We present SLAM-LLM, an open-source deep learning framework designed to train customized MLLMs, focused on speech, language, audio, and music processing. SLAM-LLM provides a modular configuration of different encoders, projectors, LLMs, and parameter-efficient fine-tuning plugins. SLAM-LLM also includes detailed training and inference recipes for mainstream tasks, along with high-performance checkpoints like LLM-based Automatic Speech Recognition (ASR), Automated Audio Captioning (AAC), and Music Captioning (MC). Some of these recipes have already reached or are nearing state-of-the-art performance, and some relevant techniques have also been accepted by academic papers. We hope SLAM-LLM will accelerate iteration, development, data engineering, and model training for researchers. We are committed to continually pushing forward audio-based MLLMs through this open-source framework, and call on the community to contribute to the LLM-based speech, audio and music processing.

Related papers

FastSLM: Hierarchical Frame Q-Former for Effective Speech Modality Adaptation [3.8125534288516683]
FastSLM is a lightweight yet efficient speech-language model (SLM) designed for effective understanding and reasoning over long-form speech.<n>We present a novel three-stage training strategy that enhances generalization across a wide range of speech-related tasks.<n> Experimental results demonstrate that FastSLM achieves competitive performance compared to existing state-of-the-art models.
arXiv Detail & Related papers (2026-01-08T07:46:03Z)
PAL: Probing Audio Encoders via LLMs - Audio Information Transfer into LLMs [29.049167884343998]
Integration of audio perception into large language models (LLMs) is an emerging research area for enabling machine listening applications.<n>We propose an efficient alternative, Lightweight Audio LLM Integration (LAL)<n>LAL encodes rich audio semantics at an appropriate level of abstraction for integration into different blocks of LLMs.
arXiv Detail & Related papers (2025-06-12T07:23:07Z)
VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation [26.34810950257782]
Speech large language models (LLMs) have emerged as a prominent research focus in speech processing.<n>We introduce VocalNet-1B and VocalNet-8B, a series of high-performance, low-latency speech LLMs enabled by a scalable and model-agnostic training framework.<n>Central to our contribution is the first application of multi-token prediction (MTP) to speech LLMs.
arXiv Detail & Related papers (2025-04-05T04:57:12Z)
LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM [35.443850239910866]
We propose a lightweight, autoregressive streaming TTS system that generates high-quality speech with low latency.<n>Our approach achieves a significantly lower Word Error Rate compared to speech-enabled LLMs, while operating at comparable latency and UTMOS score.
arXiv Detail & Related papers (2025-03-06T18:59:38Z)
Large Language Models are Strong Audio-Visual Speech Recognition Learners [53.142635674428874]
Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities.<n>We propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities.<n>We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.79% and 0.77%, respectively.
arXiv Detail & Related papers (2024-09-18T21:17:27Z)
Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions [68.98811048970963]
We present a pioneering effort to investigate the capability of large language models (LLMs) in transcribing speech in multi-talker environments.<n>We use WavLM and Whisper encoder to extract multi-faceted speech representations that are sensitive to speaker characteristics and semantic context.<n>Experiments reveal the promising performance of our proposed system, MT-LLM, in cocktail party scenarios.
arXiv Detail & Related papers (2024-09-13T07:28:28Z)
Dense Connector for MLLMs [89.50595155217108]
We introduce the Dense Connector - a plug-and-play vision-language connector that significantly enhances existing MLLMs. Building on this, we also propose the Efficient Dense Connector, which achieves performance comparable to LLaVA-v1.5 with only 25% of the visual tokens. Our model, trained solely on images, showcases remarkable zero-shot capabilities in video understanding as well.
arXiv Detail & Related papers (2024-05-22T16:25:03Z)
OneLLM: One Framework to Align All Modalities with Language [86.8818857465443]
We present OneLLM, an MLLM that aligns eight modalities to language using a unified framework.<n>OneLLM is evaluated on 25 diverse benchmarks, encompassing tasks such as multimodal captioning, question answering and reasoning.
arXiv Detail & Related papers (2023-12-06T18:59:19Z)
InfMLLM: A Unified Framework for Visual-Language Tasks [44.29407348046122]
multimodal large language models (MLLMs) have attracted growing interest. This work delves into enabling LLMs to tackle more vision-language-related tasks. InfMLLM achieves either state-of-the-art (SOTA) performance or performance comparable to recent MLLMs.
arXiv Detail & Related papers (2023-11-12T09:58:16Z)
Octavius: Mitigating Task Interference in MLLMs via LoRA-MoE [83.00018517368973]
Large Language Models (LLMs) can extend their zero-shot capabilities to multimodal learning through instruction tuning. negative conflicts and interference may have a worse impact on performance. We combine the well-known Mixture-of-Experts (MoE) and one of the representative PEFT techniques, i.e., LoRA, designing a novel LLM-based decoder, called LoRA-MoE, for multimodal learning.
arXiv Detail & Related papers (2023-11-05T15:48:29Z)
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration [50.94902442781148]
We propose a novel multi-modal large language model (LLM) that seamlessly integrates visual, audio, and textual information. Macaw-LLM consists of three main components: a modality module for encoding multi-modal data, a cognitive module for harnessing pretrained LLMs, and an alignment module for harmonizing diverse representations. We construct a large-scale multi-modal instruction dataset in terms of multi-turn dialogue, including 69K image instances and 50K video instances.
arXiv Detail & Related papers (2023-06-15T12:45:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.