SICL-AT: Another way to adapt Auditory LLM to low-resource task
- URL: http://arxiv.org/abs/2601.18904v1
- Date: Mon, 26 Jan 2026 19:15:16 GMT
- Title: SICL-AT: Another way to adapt Auditory LLM to low-resource task
- Authors: Haolong Zheng, Siyin Wang, Zengrui Jin, Mark Hasegawa-Johnson,
- Abstract summary: Auditory Large Language Models (LLMs) have demonstrated strong performance across a wide range of speech and audio understanding tasks.<n>They often struggle when applied to low-resource or unfamiliar tasks.<n>In-Context Learning (ICL) provides a training-free, inference-time solution.
- Score: 34.82834349882226
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Auditory Large Language Models (LLMs) have demonstrated strong performance across a wide range of speech and audio understanding tasks. Nevertheless, they often struggle when applied to low-resource or unfamiliar tasks. In case of labeled in-domain data is scarce or mismatched to the true test distribution, direct fine-tuning can be brittle. In-Context Learning (ICL) provides a training-free, inference-time solution by adapting auditory LLMs through conditioning on a few in-domain demonstrations. In this work, we first show that \emph{Vanilla ICL}, improves zero-shot performance across diverse speech and audio tasks for selected models which suggest this ICL adaptation capability can be generalized to multimodal setting. Building on this, we propose \textbf{Speech In-Context Learning Adaptation Training (SICL-AT)}, a post-training recipe utilizes only high resource speech data intending to strengthen model's in-context learning capability. The enhancement can generalize to audio understanding/reasoning task. Experiments indicate our proposed method consistently outperforms direct fine-tuning in low-resource scenario.
Related papers
- Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models [34.15708407614003]
Large language models (LLMs) have recently achieved impressive results in speech recognition across multiple modalities.<n>We present Omni-AVSR, a unified audio-visual LLM that combines efficient multi-granularity training with parameter-efficient adaptation.<n> Experiments on LRS2 and LRS3 show that Omni-AVSR achieves comparable or superior accuracy to state-of-the-art baselines.
arXiv Detail & Related papers (2025-11-10T16:03:44Z) - An Evaluation of Interleaved Instruction Tuning on Semantic Reasoning Performance in an Audio MLLM [15.340075567628466]
This work examined the impact of interleaved instruction tuning in an audio MLLM, where audio tokens are interleaved within the prompt.<n>Our findings show that while even zero-shot interleaved prompting improves performance on our reasoning tasks, a small amount of fine-tuning improves the results further.
arXiv Detail & Related papers (2025-11-04T03:54:55Z) - Context Tuning for In-Context Optimization [12.054433776717309]
Context Tuning is a simple and effective method to enhance few-shot adaptation of language models (LLMs) without fine-tuning model parameters.<n>In contrast to prompt-based adaptation techniques, Context Tuning initializes a trainable prompt or prefix with task-specific demonstration examples.<n>Extensive evaluations on benchmarks such as CrossFit, UnifiedQA, MMLU, BIG-Bench Hard, and ARC demonstrate that Context Tuning outperforms traditional prompt-based adaptation methods.
arXiv Detail & Related papers (2025-07-06T03:23:53Z) - Surprise Calibration for Better In-Context Learning [6.566285172635043]
In-context learning (ICL) has emerged as a powerful paradigm for task adaptation in large language models.<n>Existing bias calibration methods apply fixed class priors across all inputs, limiting their efficacy in dynamic ICL settings.<n>We introduce a novel method-Surprise (SC), which captures the temporal dynamics of class priors.
arXiv Detail & Related papers (2025-06-15T10:04:42Z) - From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data [55.2480439325792]
Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs.<n>These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks.<n>We propose a data generation framework that produces contrastive-like training data, designed to enhance ALLMs' ability to differentiate between present and absent sounds.
arXiv Detail & Related papers (2025-05-26T16:08:41Z) - LiSTEN: Learning Soft Token Embeddings for Neural Audio LLMs [29.853196429972204]
LiSTEN is a framework for adapting large language models to audio-language tasks.<n>Our approach reduces dependence on large-scale ASR or captioning datasets, achieves competitive performance with fewer trainable parameters, and simplifies training by using a single-stage process.
arXiv Detail & Related papers (2025-05-24T05:28:22Z) - Self-Powered LLM Modality Expansion for Large Speech-Text Models [62.27700381806554]
Large language models (LLMs) exhibit remarkable performance across diverse tasks.
This study aims to refine the use of speech datasets for LSM training by addressing the limitations of vanilla instruction tuning.
We introduce a self-powered LSM that leverages augmented automatic speech recognition data generated by the model itself for more effective instruction tuning.
arXiv Detail & Related papers (2024-10-04T04:34:24Z) - DeSTA2: Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data [84.01401439030265]
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs)<n>We present a simple yet effective automatic process for creating speech-text pair data.<n>Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data.
arXiv Detail & Related papers (2024-09-30T07:01:21Z) - SpeechPrompt: Prompting Speech Language Models for Speech Processing Tasks [94.10497337235083]
We are first to explore the potential of prompting speech LMs in the domain of speech processing.
We reformulate speech processing tasks into speech-to-unit generation tasks.
We show that the prompting method can achieve competitive performance compared to the strong fine-tuning method.
arXiv Detail & Related papers (2024-08-23T13:00:10Z) - Large Language Models are Efficient Learners of Noise-Robust Speech
Recognition [65.95847272465124]
Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR)
In this work, we extend the benchmark to noisy conditions and investigate if we can teach LLMs to perform denoising for GER.
Experiments on various latest LLMs demonstrate our approach achieves a new breakthrough with up to 53.9% correction improvement in terms of word error rate.
arXiv Detail & Related papers (2024-01-19T01:29:27Z) - LLM-augmented Preference Learning from Natural Language [19.700169351688768]
Large Language Models (LLMs) are equipped to deal with larger context lengths.
LLMs can consistently outperform the SotA when the target text is large.
Few-shot learning yields better performance than zero-shot learning.
arXiv Detail & Related papers (2023-10-12T17:17:27Z) - An Exploration of Prompt Tuning on Generative Spoken Language Model for
Speech Processing Tasks [112.1942546460814]
We report the first exploration of the prompt tuning paradigm for speech processing tasks based on Generative Spoken Language Model (GSLM)
Experiment results show that the prompt tuning technique achieves competitive performance in speech classification tasks with fewer trainable parameters than fine-tuning specialized downstream models.
arXiv Detail & Related papers (2022-03-31T03:26:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.