Related papers: Leveraging Whisper Embeddings for Audio-based Lyrics Matching

Leveraging Whisper Embeddings for Audio-based Lyrics Matching

URL: http://arxiv.org/abs/2510.08176v1
Date: Thu, 09 Oct 2025 13:03:34 GMT
Title: Leveraging Whisper Embeddings for Audio-based Lyrics Matching
Authors: Eleonora Mancini, Joan Serrà, Paolo Torroni, Yuki Mitsufuji,
Abstract summary: WEALY is a fully reproducible pipeline that leverages Whisper decoder embeddings for lyrics matching tasks.<n>We demonstrate that WEALY achieves a performance comparable to state-of-the-art methods that lack robustness.<n>This work contributes a reliable benchmark for future research, and underscores the potential of speech technologies for music information retrieval tasks.
Score: 35.54408523154097
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Audio-based lyrics matching can be an appealing alternative to other content-based retrieval approaches, but existing methods often suffer from limited reproducibility and inconsistent baselines. In this work, we introduce WEALY, a fully reproducible pipeline that leverages Whisper decoder embeddings for lyrics matching tasks. WEALY establishes robust and transparent baselines, while also exploring multimodal extensions that integrate textual and acoustic features. Through extensive experiments on standard datasets, we demonstrate that WEALY achieves a performance comparable to state-of-the-art methods that lack reproducibility. In addition, we provide ablation studies and analyses on language robustness, loss functions, and embedding strategies. This work contributes a reliable benchmark for future research, and underscores the potential of speech technologies for music information retrieval tasks.

Related papers

Summarizing Speech: A Comprehensive Survey [76.13011304983458]
Speech summarization has become an essential tool for efficiently managing and accessing the growing volume of spoken and audiovisual content.<n>This survey examines existing datasets and evaluation protocols, which are crucial for assessing the quality of summarization approaches.
arXiv Detail & Related papers (2025-04-10T17:50:53Z)
CrossMuSim: A Cross-Modal Framework for Music Similarity Retrieval with LLM-Powered Text Description Sourcing and Mining [15.58671300364536]
This paper presents a novel cross-modal contrastive learning framework to guide music similarity modeling.<n>To overcome the scarcity of high-quality text-music paired data, this paper introduces a dual-source data acquisition approach.<n>Experiments demonstrate that the proposed framework achieves significant performance improvements over existing benchmarks.
arXiv Detail & Related papers (2025-03-29T15:43:09Z)
Re-calibrating methodologies in social media research: Challenge the visual, work with Speech [0.0]
This article reflects on how social media scholars can effectively engage with speech-based data in their analyses.<n>I conclude that the expansion of our methodological repertoire enables richer interpretations of platformised content.
arXiv Detail & Related papers (2024-12-17T18:47:57Z)
Learning Robust Named Entity Recognizers From Noisy Data With Retrieval Augmentation [67.89838237013078]
Named entity recognition (NER) models often struggle with noisy inputs. We propose a more realistic setting in which only noisy text and its NER labels are available. We employ a multi-view training framework that improves robust NER without retrieving text during inference.
arXiv Detail & Related papers (2024-07-26T07:30:41Z)
Synthetic Lyrics Detection Across Languages and Genres [4.987546582439803]
Large language models (LLMs) to generate music content, particularly lyrics, has gained in popularity.<n>Previous research has explored content detection in various domains, but no work has focused on the text modality, lyrics, in music.<n>We curated a diverse dataset of real and synthetic lyrics from multiple languages, music genres, and artists.<n>We performed a thorough evaluation of existing synthetic text detection approaches on lyrics, a previously unexplored data type.<n>Following both music and industrial constraints, we examined how well these approaches generalize across languages, scale with data availability, handle multilingual language content, and perform on novel genres in few-shot settings
arXiv Detail & Related papers (2024-06-21T15:19:21Z)
Self-Supervised Contrastive Learning for Robust Audio-Sheet Music Retrieval Systems [3.997809845676912]
We show that self-supervised contrastive learning can mitigate the scarcity of annotated data from real music content. We employ the snippet embeddings in the higher-level task of cross-modal piece identification. In this work, we observe that the retrieval quality improves from 30% up to 100% when real music data is present.
arXiv Detail & Related papers (2023-09-21T14:54:48Z)
ERNIE-Music: Text-to-Waveform Music Generation with Diffusion Models [67.66825818489406]
This paper introduces a text-to-waveform music generation model, underpinned by the utilization of diffusion models. Our methodology hinges on the innovative incorporation of free-form textual prompts as conditional factors to guide the waveform generation process. We demonstrate that our generated music in the waveform domain outperforms previous works by a large margin in terms of diversity, quality, and text-music relevance.
arXiv Detail & Related papers (2023-02-09T06:27:09Z)
Exploring the Efficacy of Pre-trained Checkpoints in Text-to-Music Generation Task [86.72661027591394]
We generate complete and semantically consistent symbolic music scores from text descriptions. We explore the efficacy of using publicly available checkpoints for natural language processing in the task of text-to-music generation. Our experimental results show that the improvement from using pre-trained checkpoints is statistically significant in terms of BLEU score and edit distance similarity.
arXiv Detail & Related papers (2022-11-21T07:19:17Z)
Audio-text Retrieval in Context [24.38055340045366]
In this work, we investigate several audio features as well as sequence aggregation methods for better audio-text alignment. We build our contextual audio-text retrieval system using pre-trained audio features and a descriptor-based aggregation method. With our proposed system, a significant improvement has been achieved on bidirectional audio-text retrieval, on all metrics including recall, median and mean rank.
arXiv Detail & Related papers (2022-03-25T13:41:17Z)
TEACHTEXT: CrossModal Generalized Distillation for Text-Video Retrieval [103.85002875155551]
We propose a novel generalized distillation method, TeachText, for exploiting large-scale language pretraining. We extend our method to video side modalities and show that we can effectively reduce the number of used modalities at test time. Our approach advances the state of the art on several video retrieval benchmarks by a significant margin and adds no computational overhead at test time.
arXiv Detail & Related papers (2021-04-16T17:55:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.