FlexSED: Towards Open-Vocabulary Sound Event Detection
- URL: http://arxiv.org/abs/2509.18606v1
- Date: Tue, 23 Sep 2025 03:52:52 GMT
- Title: FlexSED: Towards Open-Vocabulary Sound Event Detection
- Authors: Jiarui Hai, Helin Wang, Weizhe Guo, Mounya Elhilali,
- Abstract summary: We propose FlexSED, an open-vocabulary sound event detection system.<n>It builds on a pretrained audio SSL model and the CLAP text encoder, introducing an encoder-decoder composition and an adaptive fusion strategy.<n>It achieves superior performance compared to vanilla SED models on AudioSet-Strong, while demonstrating strong zero-shot and few-shot capabilities.
- Score: 18.4414685940782
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite recent progress in large-scale sound event detection (SED) systems capable of handling hundreds of sound classes, existing multi-class classification frameworks remain fundamentally limited. They cannot process free-text sound queries, which enable more flexible and user-friendly interaction, and they lack zero-shot capabilities and offer poor few-shot adaptability. Although text-query-based separation methods have been explored, they primarily focus on source separation and are ill-suited for SED tasks that require precise temporal localization and efficient detection across large and diverse sound vocabularies. In this paper, we propose FlexSED, an open-vocabulary sound event detection system. FlexSED builds on a pretrained audio SSL model and the CLAP text encoder, introducing an encoder-decoder composition and an adaptive fusion strategy to enable effective continuous training from pretrained weights. To ensure robust supervision, it also employs large language models (LLMs) to assist in event query selection during training, addressing challenges related to missing labels. As a result, FlexSED achieves superior performance compared to vanilla SED models on AudioSet-Strong, while demonstrating strong zero-shot and few-shot capabilities. We release the code and pretrained models to support future research and applications based on FlexSED.
Related papers
- Representation-Regularized Convolutional Audio Transformer for Audio Understanding [53.092757178419355]
bootstrapping representations from scratch is computationally expensive, often requiring extensive training to converge.<n>We propose the Convolutional Audio Transformer (CAT), a unified framework designed to address these challenges.
arXiv Detail & Related papers (2026-01-29T12:16:19Z) - SICL-AT: Another way to adapt Auditory LLM to low-resource task [34.82834349882226]
Auditory Large Language Models (LLMs) have demonstrated strong performance across a wide range of speech and audio understanding tasks.<n>They often struggle when applied to low-resource or unfamiliar tasks.<n>In-Context Learning (ICL) provides a training-free, inference-time solution.
arXiv Detail & Related papers (2026-01-26T19:15:16Z) - SLAP: Scalable Language-Audio Pretraining with Variable-Duration Audio and Multi-Objective Training [31.192251626550203]
We introduce Scalable Language-Audio Pretraining (SLAP), which scales language-audio pretraining to 109 million audio-text pairs with variable audio durations.<n>SLAP unifies contrastive loss with additional self-supervised and captioning losses in a single-stage training, facilitating the learning of richer dense audio representations.
arXiv Detail & Related papers (2026-01-18T21:36:19Z) - Detect Any Sound: Open-Vocabulary Sound Event Detection with Multi-Modal Queries [23.83866791274789]
We propose a query-based framework for open-vocabulary SED guided by multi-modal queries.<n>DASM formulates SED as a frame-level retrieval task, where audio features are matched against query vectors from text or audio prompts.<n>DASM effectively balances localization accuracy with generalization to novel classes, outperforming CLAP-based methods in open-vocabulary setting.
arXiv Detail & Related papers (2025-07-22T08:24:01Z) - From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data [55.2480439325792]
Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs.<n>These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks.<n>We propose a data generation framework that produces contrastive-like training data, designed to enhance ALLMs' ability to differentiate between present and absent sounds.
arXiv Detail & Related papers (2025-05-26T16:08:41Z) - VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired.<n>We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z) - OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion [88.59397418187226]
We propose a novel unified open-vocabulary detection method called OV-DINO.
It is pre-trained on diverse large-scale datasets with language-aware selective fusion in a unified framework.
We evaluate the performance of the proposed OV-DINO on popular open-vocabulary detection benchmarks.
arXiv Detail & Related papers (2024-07-10T17:05:49Z) - Typing to Listen at the Cocktail Party: Text-Guided Target Speaker Extraction [37.27069171640074]
Humans can easily isolate a single speaker from a complex acoustic environment, a capability referred to as the "Cocktail Party Effect"
Traditional target speaker extraction approaches rely on voiceprints, which raise privacy concerns and face issues related to the quality and availability of enrollment samples.
This work introduces a novel text-guided TSE paradigm named LLM-TSE.
arXiv Detail & Related papers (2023-10-11T08:17:54Z) - Leveraging Language Model Capabilities for Sound Event Detection [10.792576135806623]
We propose an end-to-end framework for understanding audio features while simultaneously generating sound event and their temporal location.
Specifically, we employ pretrained acoustic models to capture discriminative features across different categories and language models for autoregressive text generation.
arXiv Detail & Related papers (2023-08-22T15:59:06Z) - DiffSED: Sound Event Detection with Denoising Diffusion [70.18051526555512]
We reformulate the SED problem by taking a generative learning perspective.
Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process.
During training, our model learns to reverse the noising process by converting noisy latent queries to the groundtruth versions.
arXiv Detail & Related papers (2023-08-14T17:29:41Z) - STOP: A dataset for Spoken Task Oriented Semantic Parsing [66.14615249745448]
End-to-end spoken language understanding (SLU) predicts intent directly from audio using a single model.
We release the Spoken Task-Oriented semantic Parsing (STOP) dataset, the largest and most complex SLU dataset to be publicly available.
In addition to the human-recorded audio, we are releasing a TTS-generated version to benchmark the performance for low-resource domain adaptation of end-to-end SLU systems.
arXiv Detail & Related papers (2022-06-29T00:36:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.