Related papers: MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks

MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks

URL: http://arxiv.org/abs/2507.23511v2
Date: Sat, 02 Aug 2025 02:46:50 GMT
Title: MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks
Authors: Yadong Niu, Tianzi Wang, Heinrich Dinkel, Xingwei Sun, Jiahao Zhou, Gang Li, Jizhong Liu, Xunying Liu, Junbo Zhang, Jian Luan,
Abstract summary: MECAT is a multi-expert Constructed Benchmark for Fine-Grained Audio Understanding Tasks.<n>It integrates analysis from specialized expert models with Chain-of-Thought large language model reasoning.<n>A comprehensive evaluation of state-of-the-art audio models is also presented.
Score: 38.51162036564078
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While large audio-language models have advanced open-ended audio understanding, they still fall short of nuanced human-level comprehension. This gap persists largely because current benchmarks, limited by data annotations and evaluation metrics, fail to reliably distinguish between generic and highly detailed model outputs. To this end, this work introduces MECAT, a Multi-Expert Constructed Benchmark for Fine-Grained Audio Understanding Tasks. Generated via a pipeline that integrates analysis from specialized expert models with Chain-of-Thought large language model reasoning, MECAT provides multi-perspective, fine-grained captions and open-set question-answering pairs. The benchmark is complemented by a novel metric: DATE (Discriminative-Enhanced Audio Text Evaluation). This metric penalizes generic terms and rewards detailed descriptions by combining single-sample semantic similarity with cross-sample discriminability. A comprehensive evaluation of state-of-the-art audio models is also presented, providing new insights into their current capabilities and limitations. The data and code are available at https://github.com/xiaomi-research/mecat

Related papers

Detect Any Sound: Open-Vocabulary Sound Event Detection with Multi-Modal Queries [23.83866791274789]
We propose a query-based framework for open-vocabulary SED guided by multi-modal queries.<n>DASM formulates SED as a frame-level retrieval task, where audio features are matched against query vectors from text or audio prompts.<n>DASM effectively balances localization accuracy with generalization to novel classes, outperforming CLAP-based methods in open-vocabulary setting.
arXiv Detail & Related papers (2025-07-22T08:24:01Z)
Discrete Audio Tokens: More Than a Survey! [107.69720675124255]
This paper presents a systematic review and benchmark of discrete audio tokenizers.<n>It covers speech, music, and general audio domains.<n>We propose a taxonomy of tokenization approaches based on encoder-decoder, quantization techniques, training paradigm, streamability, and application domains.
arXiv Detail & Related papers (2025-06-12T01:35:43Z)
From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data [55.2480439325792]
Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs.<n>These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks.<n>We propose a data generation framework that produces contrastive-like training data, designed to enhance ALLMs' ability to differentiate between present and absent sounds.
arXiv Detail & Related papers (2025-05-26T16:08:41Z)
IFEval-Audio: Benchmarking Instruction-Following Capability in Audio-based Large Language Models [18.11667976818302]
IFEval-Audio contains 280 audio-instruction-answer triples across six diverse dimensions.<n>Each example pairs an audio input with a text instruction, requiring the model to generate an output that follows a specified structure.<n>We benchmark state-of-the-art audio LLMs on their ability to follow audio-involved instructions.
arXiv Detail & Related papers (2025-05-22T15:15:29Z)
DAVE: Diagnostic benchmark for Audio Visual Evaluation [43.54781776394087]
We introduce DAVE, a novel benchmark dataset designed to systematically evaluate audio-visual models.<n>DAVE alleviates existing limitations by (i) ensuring both modalities are necessary to answer correctly and (ii) decoupling evaluation into atomic subcategories.<n>Our detailed analysis of state-of-the-art models reveals specific failure modes and provides targeted insights for improvement.
arXiv Detail & Related papers (2025-03-12T12:12:46Z)
CLAIR-A: Leveraging Large Language Models to Judge Audio Captions [73.51087998971418]
evaluating machine-generated audio captions is a complex task that requires considering diverse factors. We propose CLAIR-A, a simple and flexible method that leverages the zero-shot capabilities of large language models. In our evaluations, CLAIR-A better predicts human judgements of quality compared to traditional metrics.
arXiv Detail & Related papers (2024-09-19T17:59:52Z)
AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension [95.8442896569132]
We introduce AIR-Bench, the first benchmark to evaluate the ability of Large Audio-Language Models (LALMs) to understand various types of audio signals and interact with humans in the textual format. Results demonstrate a high level of consistency between GPT-4-based evaluation and human evaluation.
arXiv Detail & Related papers (2024-02-12T15:41:22Z)
SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding Tasks [88.4408774253634]
Spoken language understanding (SLU) tasks have been studied for many decades in the speech research community. There are not nearly as many SLU task benchmarks, and many of the existing ones use data that is not freely available to all researchers. Recent work has begun to introduce such benchmark for several tasks.
arXiv Detail & Related papers (2022-12-20T18:39:59Z)
Evaluating the reliability of acoustic speech embeddings [10.5754802112615]
Speech embeddings are fixed-size acoustic representations of variable-length speech sequences. Here, we systematically compare two popular metrics, ABX discrimination and Mean Average Precision (MAP), on 5 languages across 17 embedding methods. We find that overall, ABX and MAP correlate with one another and with frequency estimation.
arXiv Detail & Related papers (2020-07-27T13:24:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.