MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark
- URL: http://arxiv.org/abs/2410.19168v1
- Date: Thu, 24 Oct 2024 21:20:10 GMT
- Title: MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark
- Authors: S Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, Dinesh Manocha,
- Abstract summary: MMAU comprises 10k carefully curated audio clips paired with human-annotated natural language questions and answers.
It includes information extraction and reasoning questions, requiring models to demonstrate 27 distinct skills across unique and challenging tasks.
We assess 18 open-source and proprietary (Large) Audio-Language Models, demonstrating the significant challenges posed by MMAU.
- Score: 44.672035866509624
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The ability to comprehend audio--which includes speech, non-speech sounds, and music--is crucial for AI agents to interact effectively with the world. We present MMAU, a novel benchmark designed to evaluate multimodal audio understanding models on tasks requiring expert-level knowledge and complex reasoning. MMAU comprises 10k carefully curated audio clips paired with human-annotated natural language questions and answers spanning speech, environmental sounds, and music. It includes information extraction and reasoning questions, requiring models to demonstrate 27 distinct skills across unique and challenging tasks. Unlike existing benchmarks, MMAU emphasizes advanced perception and reasoning with domain-specific knowledge, challenging models to tackle tasks akin to those faced by experts. We assess 18 open-source and proprietary (Large) Audio-Language Models, demonstrating the significant challenges posed by MMAU. Notably, even the most advanced Gemini Pro v1.5 achieves only 52.97% accuracy, and the state-of-the-art open-source Qwen2-Audio achieves only 52.50%, highlighting considerable room for improvement. We believe MMAU will drive the audio and multimodal research community to develop more advanced audio understanding models capable of solving complex audio tasks.
Related papers
- Bagpiper: Solving Open-Ended Audio Tasks via Rich Captions [84.73122243726775]
Bagpiper is an 8B audio foundation model that interprets physical audio via rich captions.<n>During fine-tuning, Bagpiper adopts a caption-then-process workflow to solve diverse tasks without task-specific priors.<n>To the best of our knowledge, Bagpiper is among the first works that achieve unified understanding generation for general audio.
arXiv Detail & Related papers (2026-02-05T02:20:07Z) - VoiceAssistant-Eval: Benchmarking AI Assistants across Listening, Speaking, and Viewing [45.15289852736435]
VoiceAssistant-Eval comprises 10,497 curated examples spanning 13 task categories.<n>To demonstrate its utility, we evaluate 21 open-source models and GPT-4o-Audio.<n>Results reveal three key findings: proprietary models do not universally outperform open-source models.
arXiv Detail & Related papers (2025-09-26T17:59:59Z) - MDAR: A Multi-scene Dynamic Audio Reasoning Benchmark [64.89810922949984]
We introduce MDAR, a benchmark for evaluating models on complex, multi-scene, and dynamically evolving audio reasoning tasks.<n>MDAR comprises 3,000 carefully curated question-answer pairs linked to diverse audio clips.<n>We benchmark 26 state-of-the-art audio language models on MDAR and observe that they exhibit limitations in complex reasoning tasks.
arXiv Detail & Related papers (2025-09-26T15:12:46Z) - SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information [44.99833362998488]
Large audio-language models (LALMs) extend the large language models with multimodal understanding in speech, audio, etc.<n>While their performances on speech and audio-processing tasks are extensively studied, their reasoning abilities remain underexplored.<n>We introduce SAKURA, a benchmark assessing LALMs' multi-hop reasoning based on speech and audio information.<n>Results show that LALMs struggle to integrate speech/audio representations for multi-hop reasoning, even when they extract the relevant information correctly.
arXiv Detail & Related papers (2025-05-19T15:20:32Z) - Multi-Domain Audio Question Answering Toward Acoustic Content Reasoning in The DCASE 2025 Challenge [102.84031769492708]
This task defines three QA subsets to test audio-language models on interactive question-answering over diverse acoustic scenes.<n>Preliminary results on the development set are compared, showing strong variation across models and subsets.<n>This challenge aims to advance the audio understanding and reasoning capabilities of audio-language models toward human-level acuity.
arXiv Detail & Related papers (2025-05-12T09:04:16Z) - ACVUBench: Audio-Centric Video Understanding Benchmark [35.77437191750556]
ACVUBench is an audio-centric video understanding benchmark.
It incorporates 2,662 videos spanning 18 different domains with rich auditory information.
It holistically tests the understanding of both audio content and audio-visual interactions in videos.
arXiv Detail & Related papers (2025-03-25T16:28:24Z) - Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models [95.45204813682885]
We introduce Audio-Reasoner, a large-scale audio language model for deep reasoning in audio tasks.
We train Audio-Reasoner on CoTA, enabling it to achieve great logical capabilities in audio reasoning.
Our findings stress the core of structured CoT training in advancing audio reasoning.
arXiv Detail & Related papers (2025-03-04T06:18:34Z) - Audiopedia: Audio QA with Knowledge [0.0]
We introduce Audiopedia, a novel task called Audio Question Answering with Knowledge.
Unlike traditional Audio Question Answering (AQA) benchmarks that focus on simple queries answerable from audio alone, Audiopedia targets knowledge-intensive questions.
We benchmark large audio language models (LALMs) on these sub-tasks and observe suboptimal performance.
We propose a generic framework that can be adapted to any LALM, equipping them with knowledge reasoning capabilities.
arXiv Detail & Related papers (2024-12-29T23:48:35Z) - Enhancing Temporal Understanding in Audio Question Answering for Large Audio Language Models [0.9285295512807729]
Audio Question Answering has garnered attention due to the advent of Large Audio Language Models.
While LALMs excel in general audio understanding, they are limited in temporal reasoning.
This paper addresses these challenges and limitations in audio temporal reasoning.
arXiv Detail & Related papers (2024-09-10T05:26:53Z) - GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities [43.23351906406144]
General-purpose Large Audio-Language Model (LALM) with Advanced Audio Understanding and Complex Reasoning Abilities.
We build GAMA by integrating an LLM with multiple types of audio representations, including features from a custom Audio Q-Former.
We fine-tune GAMA on a large-scale audio-language dataset, which augments it with audio understanding capabilities.
arXiv Detail & Related papers (2024-06-17T17:31:01Z) - AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension [95.8442896569132]
We introduce AIR-Bench, the first benchmark to evaluate the ability of Large Audio-Language Models (LALMs) to understand various types of audio signals and interact with humans in the textual format.
Results demonstrate a high level of consistency between GPT-4-based evaluation and human evaluation.
arXiv Detail & Related papers (2024-02-12T15:41:22Z) - Qwen-Audio: Advancing Universal Audio Understanding via Unified
Large-Scale Audio-Language Models [98.34889301515412]
We develop the Qwen-Audio model and address the limitation by scaling up audio-language pre-training to cover over 30 tasks and various audio types.
Qwen-Audio achieves impressive performance across diverse benchmark tasks without requiring any task-specific fine-tuning.
We further develop Qwen-Audio-Chat, which allows for input from various audios and text inputs, enabling multi-turn dialogues and supporting various audio-central scenarios.
arXiv Detail & Related papers (2023-11-14T05:34:50Z) - SALMONN: Towards Generic Hearing Abilities for Large Language Models [24.73033723114979]
We propose SALMONN, a speech audio language music open neural network.
It is built by integrating a pre-trained text-based large language model (LLM) with speech and audio encoders into a single multimodal model.
It is the first model of its type and can be regarded as a step towards AI with generic hearing abilities.
arXiv Detail & Related papers (2023-10-20T05:41:57Z) - AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking
Head [82.69233563811487]
Large language models (LLMs) have exhibited remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition.
We propose a multi-modal AI system named AudioGPT, which complements LLMs with foundation models to process complex audio information.
arXiv Detail & Related papers (2023-04-25T17:05:38Z) - MERLOT Reserve: Neural Script Knowledge through Vision and Language and
Sound [90.1857707251566]
We introduce MERLOT Reserve, a model that represents videos jointly over time.
We replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet.
Our objective learns faster than alternatives, and performs well at scale.
arXiv Detail & Related papers (2022-01-07T19:00:21Z) - Audio-Oriented Multimodal Machine Comprehension: Task, Dataset and Model [51.42415340921237]
We propose a Dynamic Inter- and Intra-modality Attention (DIIA) model to fuse the two modalities (audio and textual)
We further develop a Multimodal Knowledge Distillation (MKD) module to enable our multimodal MC model to accurately predict the answers based only on either the text or the audio.
arXiv Detail & Related papers (2021-07-04T08:35:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.