Related papers: WavLink: Compact Audio-Text Embeddings with a Global Whisper Token

WavLink: Compact Audio-Text Embeddings with a Global Whisper Token

URL: http://arxiv.org/abs/2601.15118v2
Date: Thu, 22 Jan 2026 08:55:20 GMT
Title: WavLink: Compact Audio-Text Embeddings with a Global Whisper Token
Authors: Gokul Karthik Kumar, Ludovick Lepauloux, Hakim Hacid,
Abstract summary: We present WavLink, a compact audio-text embedding model that augments Whisper encoder with a learnable global token.<n>Our two-stage training recipe across three model sizes, combined with Matryoshka-style supervision, improves scalability, enabling 8x smaller embeddings with minimal performance drop.
Score: 4.000493292896401
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Whisper has become the de-facto encoder for extracting general-purpose audio features in large audio-language models, where a 30-second clip is typically represented by 1500 frame features projected into an LLM. In contrast, audio-text embedding models like CLAP-based models have largely relied on alternative audio encoders (e.g., HTS-AT, PaSST), and have not leveraged Whisper effectively. We present WavLink, a compact audio-text embedding model that augments Whisper encoder with a learnable global token, trained jointly with a text encoder. Through a systematic study of design choices, including pretrained text encoders, loss functions, training modes, and data mixtures, we identify configurations that yield state-of-the-art retrieval performance. Our two-stage training recipe across three model sizes, combined with Matryoshka-style supervision, improves scalability, enabling 8x smaller embeddings with minimal performance drop. WavLink also demonstrates competitive performance on AIR-Bench with MCQs and zero-shot classification.

Related papers

Scaling Open Discrete Audio Foundation Models with Interleaved Semantic, Acoustic, and Text Tokens [62.56027815951259]
Current audio language models are predominantly text-first, either extending pre-trained text LLM backbones or relying on semantic-only audio tokens.<n>This paper presents a systematic empirical study of native audio foundation models that apply next-token prediction to audio at scale.
arXiv Detail & Related papers (2026-02-18T18:32:46Z)
MAGIC-Enhanced Keyword Prompting for Zero-Shot Audio Captioning with CLIP Models [2.3310964423816896]
Automated Audio Captioning (AAC) generates captions for audio clips but faces challenges due to limited datasets.<n>We propose a zero-shot AAC system that leverages pre-trained models, eliminating the need for extensive training.<n> Experimental results demonstrate a 35% improvement in NLG mean score (from 4.7 to 7.3) using MAGIC search with the WavCaps model.
arXiv Detail & Related papers (2025-09-16T02:36:00Z)
UniVerse-1: Unified Audio-Video Generation via Stitching of Experts [59.38012380516272]
We introduce UniVerse-1, a unified, Veo-3-like model capable of simultaneously generating coordinated audio and video.<n>To enhance training efficiency, we bypass training from scratch and instead employ a stitching of experts (SoE) technique.
arXiv Detail & Related papers (2025-09-07T17:55:03Z)
Low Frame-rate Speech Codec: a Codec Designed for Fast High-quality Speech LLM Training and Inference [10.909997817643905]
We present the Low Frame-rate Speech Codec (LFSC): a neural audio that leverages a finite scalar quantization and adversarial training with large speech language models to achieve high-quality audio compression with a 1.89 kbps and 21.5 frames per second. We demonstrate that our novel LLM can make the inference of text-to-speech models around three times faster while improving intelligibility and producing quality comparable to previous models.
arXiv Detail & Related papers (2024-09-18T16:39:10Z)
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling [63.8735398698683]
A crucial component of language models is the tokenizer, which compresses high-dimensional natural signals into lower-dimensional discrete tokens.<n>We introduce WavTokenizer, which offers several advantages over previous SOTA acoustic models in the audio domain.<n>WavTokenizer achieves state-of-the-art reconstruction quality with outstanding UTMOS scores and inherently contains richer semantic information.
arXiv Detail & Related papers (2024-08-29T13:43:36Z)
Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding [30.46616330202622]
Automated audio captioning (AAC) is an audio-to-text task to describe audio contents in natural language. Recent advancements in large language models (LLMs) have opened up possibilities for improving AAC. Our method obtains a 33.0 SPIDEr-FL score, outperforming the winner of DCASE 2023 Task 6A.
arXiv Detail & Related papers (2024-06-19T07:09:46Z)
Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large Multi-Modal Models [53.48409081555687]
In this work, we explore such large pre-trained models to obtain features, i.e. CLIP for visual features, and CLAP for audio features. We propose a simple yet effective model that only relies on feed-forward neural networks. Our framework achieves state-of-the-art performance on VGGSound-GZSL, UCF-GZSL, and ActivityNet-GZSL.
arXiv Detail & Related papers (2024-04-09T13:39:37Z)
Text-Driven Foley Sound Generation With Latent Diffusion Model [33.4636070590045]
Foley sound generation aims to synthesise the background sound for multimedia content. We propose a diffusion model based system for Foley sound generation with text conditions.
arXiv Detail & Related papers (2023-06-17T14:16:24Z)
CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models [50.42886595228255]
We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge. We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining model.
arXiv Detail & Related papers (2023-06-16T05:42:01Z)
Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data [145.95460945321253]
We introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes. The proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training.
arXiv Detail & Related papers (2022-03-31T15:33:56Z)
Automatic Audio Captioning using Attention weighted Event based Embeddings [25.258177951665594]
We propose an encoder-decoder architecture with light-weight (i.e. with lesser learnable parameters) Bi-LSTM recurrent layers for AAC. Our results show that an efficient AED based embedding extractor combined with temporal attention and augmentation techniques is able to surpass existing literature.
arXiv Detail & Related papers (2022-01-28T05:54:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.