MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix
- URL: http://arxiv.org/abs/2505.13032v1
- Date: Mon, 19 May 2025 12:18:42 GMT
- Title: MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix
- Authors: Ziyang Ma, Yinghao Ma, Yanqiao Zhu, Chen Yang, Yi-Wen Chao, Ruiyang Xu, Wenxi Chen, Yuanzhe Chen, Zhuo Chen, Jian Cong, Kai Li, Keliang Li, Siyou Li, Xinfeng Li, Xiquan Li, Zheng Lian, Yuzhe Liang, Minghao Liu, Zhikang Niu, Tianrui Wang, Yuping Wang, Yuxuan Wang, Yihao Wu, Guanrou Yang, Jianwei Yu, Ruibin Yuan, Zhisheng Zheng, Ziya Zhou, Haina Zhu, Wei Xue, Emmanouil Benetos, Kai Yu, Eng-Siong Chng, Xie Chen,
- Abstract summary: MMAR comprises 1,000 meticulously curated audio-question-answer triplets.<n>MMAR extends existing benchmarks to a broad spectrum of real-world audio scenarios.<n>We evaluate MMAR using a broad set of models, including Large Audio-Language Models (LALMs)
- Score: 50.71803775663387
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce MMAR, a new benchmark designed to evaluate the deep reasoning capabilities of Audio-Language Models (ALMs) across massive multi-disciplinary tasks. MMAR comprises 1,000 meticulously curated audio-question-answer triplets, collected from real-world internet videos and refined through iterative error corrections and quality checks to ensure high quality. Unlike existing benchmarks that are limited to specific domains of sound, music, or speech, MMAR extends them to a broad spectrum of real-world audio scenarios, including mixed-modality combinations of sound, music, and speech. Each question in MMAR is hierarchically categorized across four reasoning layers: Signal, Perception, Semantic, and Cultural, with additional sub-categories within each layer to reflect task diversity and complexity. To further foster research in this area, we annotate every question with a Chain-of-Thought (CoT) rationale to promote future advancements in audio reasoning. Each item in the benchmark demands multi-step deep reasoning beyond surface-level understanding. Moreover, a part of the questions requires graduate-level perceptual and domain-specific knowledge, elevating the benchmark's difficulty and depth. We evaluate MMAR using a broad set of models, including Large Audio-Language Models (LALMs), Large Audio Reasoning Models (LARMs), Omni Language Models (OLMs), Large Language Models (LLMs), and Large Reasoning Models (LRMs), with audio caption inputs. The performance of these models on MMAR highlights the benchmark's challenging nature, and our analysis further reveals critical limitations of understanding and reasoning capabilities among current models. We hope MMAR will serve as a catalyst for future advances in this important but little-explored area.
Related papers
- Chain of Questions: Guiding Multimodal Curiosity in Language Models [2.0180882714261568]
Chain of Questions (CoQ) is a curiosity-driven reasoning approach that encourages multimodal language models to generate targeted questions regarding their surroundings.<n>We evaluate our framework on a novel multimodal benchmark dataset, assembled by integrating WebGPT, ScienceQA, AVSD, and ScanQA datasets.
arXiv Detail & Related papers (2025-08-06T11:42:54Z) - MARBLE: A Hard Benchmark for Multimodal Spatial Reasoning and Planning [10.602434753538535]
The ability to process information from multiple modalities and to reason through it step-by-step remains a critical challenge in advancing artificial intelligence.<n>Here, we present MARBLE, a challenging multimodal reasoning benchmark that is designed to scrutinize multimodal language models.<n>We find that current MLLMs perform poorly on MARBLE -- all the 12 advanced models obtain near-random performance on M-Portal and 0% accuracy on M-Cube.
arXiv Detail & Related papers (2025-06-28T19:44:32Z) - MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks [67.31276358668424]
We introduce a novel task named AV-HaystacksQA, where the goal is to identify salient segments across different videos in response to a query and link them together to generate the most informative answer.<n> AVHaystacks is an audio-visual benchmark comprising 3100 annotated QA pairs designed to assess the capabilities of LMMs in multi-video retrieval and temporal grounding task.<n>We propose a model-agnostic, multi-agent framework to address this challenge, achieving up to 89% and 65% relative improvements over baseline methods on BLEU@4 and GPT evaluation scores in QA task on our proposed AVHaystack
arXiv Detail & Related papers (2025-06-08T06:34:29Z) - SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information [44.99833362998488]
Large audio-language models (LALMs) extend the large language models with multimodal understanding in speech, audio, etc.<n>While their performances on speech and audio-processing tasks are extensively studied, their reasoning abilities remain underexplored.<n>We introduce SAKURA, a benchmark assessing LALMs' multi-hop reasoning based on speech and audio information.<n>Results show that LALMs struggle to integrate speech/audio representations for multi-hop reasoning, even when they extract the relevant information correctly.
arXiv Detail & Related papers (2025-05-19T15:20:32Z) - MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models [50.43793764203352]
We introduce MDK12-Bench, a multi-disciplinary benchmark assessing the reasoning capabilities of MLLMs via real-world K-12 examinations.<n>Our benchmark comprises 140K reasoning instances across diverse difficulty levels from primary school to 12th grade.<n>It features 6,827 instance-level knowledge point annotations based on a well-organized knowledge structure, detailed answer explanations, difficulty labels and cross-year partitions.
arXiv Detail & Related papers (2025-04-08T08:06:53Z) - URO-Bench: A Comprehensive Benchmark for End-to-End Spoken Dialogue Models [8.882948576463244]
We propose URO-Bench, an extensive benchmark for spoken dialogue models (SDMs)<n>URO-Bench is the first S2S benchmark that covers evaluations about multilingualism, multi-round dialogues, and paralinguistics.<n>Our benchmark is divided into two difficulty levels: basic track and pro track, consisting of 16 and 20 datasets respectively.
arXiv Detail & Related papers (2025-02-25T03:31:48Z) - Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark [73.27104042215207]
We introduce EMMA, a benchmark targeting organic multimodal reasoning across mathematics, physics, chemistry, and coding.<n>EMMA tasks demand advanced cross-modal reasoning that cannot be addressed by reasoning independently in each modality.<n>Our evaluation of state-of-the-art MLLMs on EMMA reveals significant limitations in handling complex multimodal and multi-step reasoning tasks.
arXiv Detail & Related papers (2025-01-09T18:55:52Z) - VoxEval: Benchmarking the Knowledge Understanding Capabilities of End-to-End Spoken Language Models [32.086847480051084]
We present VoxEval, a novel SpeechQA benchmark that assesses knowledge understanding through pure speech interactions.<n>Our benchmark 1) maintains speech format for both inputs and outputs, 2) evaluates model robustness across diverse input audio conditions, and 3) pioneers the assessment of complex tasks like mathematical reasoning in spoken format.
arXiv Detail & Related papers (2025-01-09T04:30:12Z) - A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks [74.52259252807191]
Multimodal Large Language Models (MLLMs) address the complexities of real-world applications far beyond the capabilities of single-modality systems.
This paper systematically sorts out the applications of MLLM in multimodal tasks such as natural language, vision, and audio.
arXiv Detail & Related papers (2024-08-02T15:14:53Z) - Recent Advances in Hate Speech Moderation: Multimodality and the Role of Large Models [52.24001776263608]
This comprehensive survey delves into the recent strides in HS moderation.
We highlight the burgeoning role of large language models (LLMs) and large multimodal models (LMMs)
We identify existing gaps in research, particularly in the context of underrepresented languages and cultures.
arXiv Detail & Related papers (2024-01-30T03:51:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.