MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark
- URL: http://arxiv.org/abs/2506.04779v1
- Date: Thu, 05 Jun 2025 09:09:36 GMT
- Title: MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark
- Authors: Dingdong Wang, Jincenzi Wu, Junan Li, Dongchao Yang, Xueyuan Chen, Tianhua Zhang, Helen Meng,
- Abstract summary: MMSU comprises 5,000 meticulously curated audio-question-answer triplets across 47 distinct tasks.<n>We ground our benchmark in linguistic theory, including phonetics, prosody, rhetoric, syntactics, semantics, and paralinguistics.<n>MMSU establishes a new standard for comprehensive assessment of spoken language understanding.
- Score: 42.58439306999647
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speech inherently contains rich acoustic information that extends far beyond the textual language. In real-world spoken language understanding, effective interpretation often requires integrating semantic meaning (e.g., content), paralinguistic features (e.g., emotions, speed, pitch) and phonological characteristics (e.g., prosody, intonation, rhythm), which are embedded in speech. While recent multimodal Speech Large Language Models (SpeechLLMs) have demonstrated remarkable capabilities in processing audio information, their ability to perform fine-grained perception and complex reasoning in natural speech remains largely unexplored. To address this gap, we introduce MMSU, a comprehensive benchmark designed specifically for understanding and reasoning in spoken language. MMSU comprises 5,000 meticulously curated audio-question-answer triplets across 47 distinct tasks. To ground our benchmark in linguistic theory, we systematically incorporate a wide range of linguistic phenomena, including phonetics, prosody, rhetoric, syntactics, semantics, and paralinguistics. Through a rigorous evaluation of 14 advanced SpeechLLMs, we identify substantial room for improvement in existing models, highlighting meaningful directions for future optimization. MMSU establishes a new standard for comprehensive assessment of spoken language understanding, providing valuable insights for developing more sophisticated human-AI speech interaction systems. MMSU benchmark is available at https://huggingface.co/datasets/ddwang2000/MMSU. Evaluation Code is available at https://github.com/dingdongwang/MMSU_Bench.
Related papers
- GOAT-SLM: A Spoken Language Model with Paralinguistic and Speaker Characteristic Awareness [43.67571101152883]
We introduce GOAT-SLM, a novel spoken language model with paralinguistic and speaker characteristic awareness.<n> GOAT-SLM adopts a dual-modality head architecture that decouples linguistic modeling from acoustic realization.<n>We show that GOAT-SLM well-balanced performance across both semantic and non-semantic tasks, and outperforms existing open-source models in handling emotion, dialectal variation, and age-sensitive interactions.
arXiv Detail & Related papers (2025-07-24T06:10:29Z) - BoSS: Beyond-Semantic Speech [43.96461266560891]
Beyond-Semantic Speech (BoSS) refers to the set of information in speech communication that encompasses but transcends explicit semantics.<n>We present a formalized framework for BoSS, leveraging cognitive relevance theories and machine learning models to analyze temporal and contextual speech dynamics.<n>These findings highlight the need for advancing BoSS research to enable richer, more context-aware human-machine communication.
arXiv Detail & Related papers (2025-07-23T14:53:50Z) - What Makes a Good Speech Tokenizer for LLM-Centric Speech Generation? A Systematic Study [58.55905182336196]
Speech-language models (SLMs) offer a promising path toward unifying speech and text understanding and generation.<n>We investigate the role of speech tokenizer designs in LLM-centric SLMs, augmented by speech heads and speaker modeling.<n>We introduce multi-token prediction (MTP) into SLMs, enabling each hidden state to decode multiple speech tokens.
arXiv Detail & Related papers (2025-06-14T15:26:31Z) - VocalBench: Benchmarking the Vocal Conversational Abilities for Speech Interaction Models [26.34810950257782]
We propose VocalBench, a benchmark designed to evaluate speech interaction models' capabilities in vocal communication.<n>VocalBench comprises 9,400 carefully curated instances across four key dimensions: semantic quality, acoustic performance, conversational abilities, and robustness.<n> Experimental results reveal significant variability in current model capabilities, each exhibiting distinct strengths and weaknesses.
arXiv Detail & Related papers (2025-05-21T16:34:07Z) - Language-agnostic, automated assessment of listeners' speech recall using large language models [0.0]
This research leverages modern large language models (LLMs) in native English speakers and native speakers of 10 other languages.<n>Participants listened to and freely recalled short stories (in quiet/clear and in babble noise) in their native language.<n>LLMs prompt engineering with semantic similarity analyses to score speech recall revealed sensitivity to known effects of temporal order, primacy/recency, and background noise.
arXiv Detail & Related papers (2025-03-02T22:28:41Z) - Roadmap towards Superhuman Speech Understanding using Large Language Models [60.57947401837938]
Large language models (LLMs) integrate speech and audio data.
Recent advances, such as GPT-4o, highlight the potential for end-to-end speech LLMs.
We propose a five-level roadmap, ranging from basic automatic speech recognition (ASR) to advanced superhuman models.
arXiv Detail & Related papers (2024-10-17T06:44:06Z) - Listen and Speak Fairly: A Study on Semantic Gender Bias in Speech Integrated Large Language Models [38.64792118903994]
We evaluate gender bias in SILLMs across four semantic-related tasks.
Our analysis reveals that bias levels are language-dependent and vary with different evaluation methods.
arXiv Detail & Related papers (2024-07-09T15:35:43Z) - An Empirical Study of Speech Language Models for Prompt-Conditioned Speech Synthesis [45.558316325252335]
Speech language models (LMs) are promising for high-quality speech synthesis through in-context learning.
We study how the synthesized audio is controlled by the prompt and content.
arXiv Detail & Related papers (2024-03-19T03:22:28Z) - Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue [71.15186328127409]
Paralinguistics-enhanced Generative Pretrained Transformer (ParalinGPT)
Model takes the conversational context of text, speech embeddings, and paralinguistic attributes as input prompts within a serialized multitasking framework.
We utilize the Switchboard-1 corpus, including its sentiment labels as the paralinguistic attribute, as our spoken dialogue dataset.
arXiv Detail & Related papers (2023-12-23T18:14:56Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.