Related papers: Generalizable Prompt Tuning for Audio-Language Models via Semantic Expansion

Generalizable Prompt Tuning for Audio-Language Models via Semantic Expansion

URL: http://arxiv.org/abs/2601.20867v1
Date: Tue, 06 Jan 2026 12:47:32 GMT
Title: Generalizable Prompt Tuning for Audio-Language Models via Semantic Expansion
Authors: Jaehyuk Jang, Wonjun Lee, Kangwook Ko, Changick Kim,
Abstract summary: We propose Semantically Expanded Prompt Tuning (SEPT) for prompt tuning in audio-language models (ALMs)<n>SEPT regularizes the prompt embedding space by incorporating semantic neighbors generated by large language models.<n>Extensive experiments demonstrate that SEPT consistently improves generalization performance across multiple prompt tuning baselines.
Score: 32.60365302637783
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Prompt tuning has achieved remarkable progress in vision-language models (VLMs) and is recently being adopted for audio-language models (ALMs). However, its generalization ability in ALMs remains largely underexplored. We observe that conventional prompt tuning for ALMs also suffers from the Base-New Tradeoff, and we identify that this issue stems from the disrupted semantic structure of the embedding space. To address this issue, we propose Semantically Expanded Prompt Tuning (SEPT)-a plug-and-play framework that explicitly regularizes the prompt embedding space by incorporating semantic neighbors generated by large language models. SEPT introduces a novel semantic expansion loss with margin constraints that promote intra-class compactness and inter-class separability, thereby enhancing the semantic structure of the prompt embedding space. For comprehensive evaluation, we establish the first benchmark setup for prompt generalization in ALMs, covering both base-to-new generalization and cross-dataset transferability. Extensive experiments demonstrate that SEPT consistently improves generalization performance across multiple prompt tuning baselines, while maintaining computational cost during inference. Codes are available in https://github.com/jhyukjang/SEPT.

Related papers

SARM: LLM-Augmented Semantic Anchor for End-to-End Live-Streaming Ranking [49.109782956280064]
Large-scale live-streaming recommendation requires precise modeling of non-stationary content semantics under real-time serving constraints.<n>We propose textbfSARM, an end-to-end ranking architecture that integrates natural-language semantic anchors directly into ranking optimization.<n>SARM is fully deployed and serves over 400 million users daily.
arXiv Detail & Related papers (2026-02-10T04:15:53Z)
MoE Adapter for Large Audio Language Models: Sparsity, Disentanglement, and Gradient-Conflict-Free [27.346096262060787]
We introduce the textittextbfMoE-Adapter, a sparse Mixture-of-Experts(MoE) architecture designed to decouple acoustic information.<n>Experiments show that the MoE-Adapter achieves superior performance on both audio semantic and paralinguistic tasks.
arXiv Detail & Related papers (2026-01-06T12:24:38Z)
STELLA: Guiding Large Language Models for Time Series Forecasting with Semantic Abstractions [4.169671705130711]
We propose STELLA, a framework that mines and injects structured supplementary and complementary information.<n>StELLA employs a dynamic semantic abstraction mechanism that decouples input series into trend, seasonality, and residual components.<n>Experiments on eight benchmark datasets demonstrate that STELLA outperforms state-of-the-art methods in long- and short-term forecasting.
arXiv Detail & Related papers (2025-12-04T14:56:36Z)
GA2-CLIP: Generic Attribute Anchor for Efficient Prompt Tuningin Video-Language Models [34.002791706686345]
Visual and textual soft prompt tuning can improve the adaptability of Vision-Language Models (VLMs) in downstream tasks.<n>Existing methods attempt to mitigate this effect by regularizing the gap between hand-crafted prompts and soft prompts.<n>We propose a plug-and-play coupling prompt learning framework to optimize the performance of V-L models in video tasks.
arXiv Detail & Related papers (2025-11-27T05:36:47Z)
What Makes You Unique? Attribute Prompt Composition for Object Re-Identification [70.67907354506278]
Object Re-IDentification aims to recognize individuals across non-overlapping camera views.<n>Single-domain models tend to overfit to domain-specific features, whereas cross-domain models often rely on diverse normalization strategies.<n>We propose an Attribute Prompt Composition framework, which exploits textual semantics to jointly enhance discrimination and generalization.
arXiv Detail & Related papers (2025-09-23T07:03:08Z)
Weakly-supervised Audio Temporal Forgery Localization via Progressive Audio-language Co-learning Network [17.91342898415867]
Existing ATFL methods rely on training efficient networks using fine-grained annotations.<n>We propose a progressive audio-language co-learning network (LOCO) that adopts co-learning and self-supervision manners to prompt localization performance.<n>The proposed LOCO achieves SOTA performance on three public benchmarks.
arXiv Detail & Related papers (2025-05-03T17:57:57Z)
Prompt-OT: An Optimal Transport Regularization Paradigm for Knowledge Preservation in Vision-Language Model Adaptation [5.296260279593993]
Vision-language models (VLMs) such as CLIP demonstrate strong performance but struggle when adapted to downstream tasks.<n>We propose an optimal transport (OT)-guided prompt learning framework that mitigates forgetting by preserving the structural consistency of feature distributions.<n>Our approach enforces joint constraints on both vision and text representations, ensuring a holistic feature alignment.
arXiv Detail & Related papers (2025-03-11T21:38:34Z)
Open-Vocabulary Segmentation with Semantic-Assisted Calibration [68.41025728960176]
We study open-vocabulary segmentation (OVS) through calibrating in-vocabulary and domain-biased embedding space with contextual prior of CLIP.<n>We present a Semantic-assisted CAlibration Network (SCAN) to achieve state-of-the-art performance on open-vocabulary segmentation benchmarks.
arXiv Detail & Related papers (2023-12-07T07:00:09Z)
Temporal-aware Hierarchical Mask Classification for Video Semantic Segmentation [62.275143240798236]
Video semantic segmentation dataset has limited categories per video. Less than 10% of queries could be matched to receive meaningful gradient updates during VSS training. Our method achieves state-of-the-art performance on the latest challenging VSS benchmark VSPW without bells and whistles.
arXiv Detail & Related papers (2023-09-14T20:31:06Z)
Learning Label Modular Prompts for Text Classification in the Wild [56.66187728534808]
We propose text classification in-the-wild, which introduces different non-stationary training/testing stages. Decomposing a complex task into modular components can enable robust generalisation under such non-stationary environment. We propose MODULARPROMPT, a label-modular prompt tuning framework for text classification tasks.
arXiv Detail & Related papers (2022-11-30T16:26:38Z)
Bayesian Prompt Learning for Image-Language Model Generalization [64.50204877434878]
We use the regularization ability of Bayesian methods to frame prompt learning as a variational inference problem. Our approach regularizes the prompt space, reduces overfitting to the seen prompts and improves the prompt generalization on unseen prompts. We demonstrate empirically on 15 benchmarks that Bayesian prompt learning provides an appropriate coverage of the prompt space.
arXiv Detail & Related papers (2022-10-05T17:05:56Z)
Guiding the PLMs with Semantic Anchors as Intermediate Supervision: Towards Interpretable Semantic Parsing [57.11806632758607]
We propose to incorporate the current pretrained language models with a hierarchical decoder network. By taking the first-principle structures as the semantic anchors, we propose two novel intermediate supervision tasks. We conduct intensive experiments on several semantic parsing benchmarks and demonstrate that our approach can consistently outperform the baselines.
arXiv Detail & Related papers (2022-10-04T07:27:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.