Related papers: SLAP: Scalable Language-Audio Pretraining with Variable-Duration Audio and Multi-Objective Training

SLAP: Scalable Language-Audio Pretraining with Variable-Duration Audio and Multi-Objective Training

URL: http://arxiv.org/abs/2601.12594v1
Date: Sun, 18 Jan 2026 21:36:19 GMT
Title: SLAP: Scalable Language-Audio Pretraining with Variable-Duration Audio and Multi-Objective Training
Authors: Xinhao Mei, Gael Le Lan, Haohe Liu, Zhaoheng Ni, Varun Nagaraja, Yang Liu, Yangyang Shi, Vikas Chandra,
Abstract summary: We introduce Scalable Language-Audio Pretraining (SLAP), which scales language-audio pretraining to 109 million audio-text pairs with variable audio durations.<n>SLAP unifies contrastive loss with additional self-supervised and captioning losses in a single-stage training, facilitating the learning of richer dense audio representations.
Score: 31.192251626550203
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Contrastive language-audio pretraining (CLAP) has achieved notable success in learning semantically rich audio representations and is widely adopted for various audio-related tasks. However, current CLAP models face several key limitations. First, they are typically trained on relatively small datasets, often comprising a few million audio samples. Second, existing CLAP models are restricted to short and fixed duration, which constrains their usage in real-world scenarios with variable-duration audio. Third, the standard contrastive training objective operates on global representations, which may hinder the learning of dense, fine-grained audio features. To address these challenges, we introduce Scalable Language-Audio Pretraining (SLAP), which scales language-audio pretraining to 109 million audio-text pairs with variable audio durations and incorporates multiple training objectives. SLAP unifies contrastive loss with additional self-supervised and captioning losses in a single-stage training, facilitating the learning of richer dense audio representations. The proposed SLAP model achieves new state-of-the-art performance on audio-text retrieval and zero-shot audio classification tasks, demonstrating its effectiveness across diverse benchmarks.

Related papers

Revisiting Audio-language Pretraining for Learning General-purpose Audio Representation [30.42124709340273]
We identify three key barriers: limited large-scale audio-text corpora, insufficient caption diversity, and lack of systematic exploration and evaluation.<n>Our results demonstrate that audio-language pretraining yields competitive, transferable representations.<n>These findings establish audio-language pretraining as a viable pathway toward general-purpose audio representations.
arXiv Detail & Related papers (2025-11-20T19:17:35Z)
USAD: Universal Speech and Audio Representation via Distillation [56.91647396619358]
Universal Speech and Audio Distillation (USAD) is a unified approach to audio representation learning.<n>USAD integrates diverse audio types - speech, sound, and music - into a single model.
arXiv Detail & Related papers (2025-06-23T17:02:00Z)
From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data [55.2480439325792]
Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs.<n>These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks.<n>We propose a data generation framework that produces contrastive-like training data, designed to enhance ALLMs' ability to differentiate between present and absent sounds.
arXiv Detail & Related papers (2025-05-26T16:08:41Z)
Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction [9.101978573666546]
Baichuan-Audio is an end-to-end audio large language model that seamlessly integrates audio understanding and generation.<n>It features a text-guided aligned speech generation mechanism, enabling real-time speech interaction with both comprehension and generation capabilities.
arXiv Detail & Related papers (2025-02-24T15:16:34Z)
Enhancing Audio-Language Models through Self-Supervised Post-Training with Text-Audio Pairs [3.8300818830608345]
Multi-modal contrastive learning strategies for audio and text have rapidly gained interest.<n>The ability of these models to understand natural language and temporal relations is still a largely unexplored and open field for research.<n>We propose to equip the multi-modal ALMs with temporal understanding without loosing their inherent prior capabilities of audio-language tasks with a temporal instillation method TeminAL.
arXiv Detail & Related papers (2024-08-17T18:53:17Z)
T-CLAP: Temporal-Enhanced Contrastive Language-Audio Pretraining [38.604112878493396]
Contrastive language-audio pretraining(CLAP) has been developed to align the representations of audio and language. We introduce T-CLAP, a temporal-enhanced CLAP model, to capture temporal information within audio and text features. T-CLAP shows improved capability in capturing the temporal relationship of sound events and outperforms state-of-the-art models by a significant margin.
arXiv Detail & Related papers (2024-04-27T07:05:48Z)
XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception [62.660135152900615]
Speech recognition and translation systems perform poorly on noisy inputs. XLAVS-R is a cross-lingual audio-visual speech representation model for noise-robust speech recognition and translation.
arXiv Detail & Related papers (2024-03-21T13:52:17Z)
AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining [46.22290575167155]
This paper proposes a framework that utilizes the same learning method for speech, music, and sound effect generation. Our framework introduces a general representation of audio, called "language of audio" (LOA)
arXiv Detail & Related papers (2023-08-10T17:55:13Z)
AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models. It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z)
BEATs: Audio Pre-Training with Acoustic Tokenizers [77.8510930885778]
Self-supervised learning (SSL) has been witnessed in language, vision, speech, and audio domains over the past few years. We propose BEATs, an iterative audio pre-training framework to learn Bidirectional representation from Audio Transformers. In the first iteration, we use random projection as the acoustic tokenizer to train an audio SSL model in a mask and label prediction manner. Then, we train an acoustic tokenizer for the next iteration by distilling the semantic knowledge from the pre-trained or fine-tuned audio SSL model.
arXiv Detail & Related papers (2022-12-18T10:41:55Z)
SLICER: Learning universal audio representations using low-resource self-supervised pre-training [53.06337011259031]
We present a new Self-Supervised Learning approach to pre-train encoders on unlabeled audio data. Our primary aim is to learn audio representations that can generalize across a large variety of speech and non-speech tasks.
arXiv Detail & Related papers (2022-11-02T23:45:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.