Related papers: Massive Sound Embedding Benchmark (MSEB)

Massive Sound Embedding Benchmark (MSEB)

URL: http://arxiv.org/abs/2602.07143v1
Date: Fri, 06 Feb 2026 19:33:33 GMT
Title: Massive Sound Embedding Benchmark (MSEB)
Authors: Georg Heigold, Ehsan Variani, Tom Bagby, Cyril Allauzen, Ji Ma, Shankar Kumar, Michael Riley,
Abstract summary: We present the Massive Sound Embedding Benchmark (MSEB), a framework designed to evaluate the auditory components of any multimodal system.<n>MSEB offers a comprehensive suite of eight core tasks, with more planned for the future, supported by diverse datasets.<n>Our initial experiments establish clear performance headrooms, highlighting the significant opportunity to improve real-world multimodal experiences.
Score: 12.647736296545224
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Audio is a critical component of multimodal perception, and any truly intelligent system must demonstrate a wide range of auditory capabilities. These capabilities include transcription, classification, retrieval, reasoning, segmentation, clustering, reranking, and reconstruction. Fundamentally, each task involves transforming a raw audio signal into a meaningful 'embedding' - be it a single vector, a sequence of continuous or discrete representations, or another structured form - which then serves as the basis for generating the task's final response. To accelerate progress towards robust machine auditory intelligence, we present the Massive Sound Embedding Benchmark (MSEB): an extensible framework designed to evaluate the auditory components of any multimodal system. In its first release, MSEB offers a comprehensive suite of eight core tasks, with more planned for the future, supported by diverse datasets, including the new, large-scale Simple Voice Questions (SVQ) dataset. Our initial experiments establish clear performance headrooms, highlighting the significant opportunity to improve real-world multimodal experiences where audio is a core signal. We encourage the research community to use MSEB to assess their algorithms and contribute to its growth. The library is publicly hosted at github.

Related papers

OmniRet: Efficient and High-Fidelity Omni Modality Retrieval [51.80205678389465]
We present OmniRet, the first retrieval model capable of handling complex, composed queries spanning three key modalities: text, vision, and audio.<n>Our model demonstrates significant improvements on composed query, audio and video retrieval tasks, while achieving on-par performance with state-of-the-art models on others.
arXiv Detail & Related papers (2026-03-02T17:19:55Z)
Representation-Regularized Convolutional Audio Transformer for Audio Understanding [53.092757178419355]
bootstrapping representations from scratch is computationally expensive, often requiring extensive training to converge.<n>We propose the Convolutional Audio Transformer (CAT), a unified framework designed to address these challenges.
arXiv Detail & Related papers (2026-01-29T12:16:19Z)
Harmonizing the Arabic Audio Space with Data Scheduling [15.84874997729878]
This paper presents the first systematic study of multi-task instruction tuning for an Arabic-centric audio LLM.<n>We fine-tune Qwen2.5- Omni (7B) and propose Task-Progressive Curriculum (TPC) along with Aligner-Based Diverse Sampling (ADS)<n>Our results reveal a critical efficiency, robustness trade-off: while ADS accelerates initial convergence, its inherent gradient volatility can destabilize generative decoding under prolonged training.
arXiv Detail & Related papers (2026-01-18T17:08:31Z)
Discrete Audio Tokens: More Than a Survey! [137.3721175670642]
This paper presents a systematic review and benchmark of discrete audio tokenizers.<n>It covers speech, music, and general audio domains.<n>We propose a taxonomy of tokenization approaches based on encoder-decoder, quantization techniques, training paradigm, streamability, and application domains.
arXiv Detail & Related papers (2025-06-12T01:35:43Z)
MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks [67.31276358668424]
We introduce a novel task named AV-HaystacksQA, where the goal is to identify salient segments across different videos in response to a query and link them together to generate the most informative answer.<n> AVHaystacks is an audio-visual benchmark comprising 3100 annotated QA pairs designed to assess the capabilities of LMMs in multi-video retrieval and temporal grounding task.<n>We propose a model-agnostic, multi-agent framework to address this challenge, achieving up to 89% and 65% relative improvements over baseline methods on BLEU@4 and GPT evaluation scores in QA task on our proposed AVHaystack
arXiv Detail & Related papers (2025-06-08T06:34:29Z)
Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks [112.6716697906318]
We present Dynamic-SUPERB Phase-2, an open benchmark for the comprehensive evaluation of instruction-based universal speech models.<n>Building upon the first generation, this second version incorporates 125 new tasks, expanding the benchmark to a total of 180 tasks.<n> Evaluation results show that no model performed well universally.
arXiv Detail & Related papers (2024-11-08T06:33:22Z)
AudioBench: A Universal Benchmark for Audio Large Language Models [41.46064884020139]
We introduce AudioBench, a universal benchmark designed to evaluate Audio Large Language Models (AudioLLMs)<n>It encompasses 8 distinct tasks and 26 datasets, among which, 7 are newly proposed datasets.<n>The evaluation targets three main aspects: speech understanding, audio scene understanding, and voice understanding (paralinguistic)
arXiv Detail & Related papers (2024-06-23T05:40:26Z)
Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech [107.81472531864195]
Text language models have shown remarkable zero-shot capability in generalizing to unseen tasks when provided with well-formulated instructions. We present Dynamic-SUPERB, a benchmark for building universal speech models capable of leveraging instruction tuning to perform multiple tasks in a zero-shot fashion.
arXiv Detail & Related papers (2023-09-18T06:43:30Z)
BASEN: Time-Domain Brain-Assisted Speech Enhancement Network with Convolutional Cross Attention in Multi-talker Conditions [36.15815562576836]
Time-domain single-channel speech enhancement (SE) still remains challenging to extract the target speaker without prior information on multi-talker conditions. We propose a novel time-domain brain-assisted SE network (BASEN) incorporating electroencephalography (EEG) signals recorded from the listener for extracting the target speaker from monaural speech mixtures.
arXiv Detail & Related papers (2023-05-17T06:40:31Z)
High-resolution embedding extractor for speaker diarisation [15.392429990363492]
This study proposes a novel embedding extractor architecture, referred to as a high-resolution embedding extractor (HEE) HEE consists of a feature-map extractor and an enhancer, where the enhancer with the self-attention mechanism is the key to success. Through experiments on five evaluation sets, including four public datasets, the proposed HEE demonstrates at least 10% improvement on each evaluation set.
arXiv Detail & Related papers (2022-11-08T07:41:18Z)
BYOL-S: Learning Self-supervised Speech Representations by Bootstrapping [19.071463356974387]
This work extends existing methods based on self-supervised learning by bootstrapping, proposes various encoder architectures, and explores the effects of using different pre-training datasets. We present a novel training framework to come up with a hybrid audio representation, which combines handcrafted and data-driven learned audio features. All the proposed representations were evaluated within the HEAR NeurIPS 2021 challenge for auditory scene classification and timestamp detection tasks.
arXiv Detail & Related papers (2022-06-24T02:26:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.