Related papers: SoundMind: RL-Incentivized Logic Reasoning for Audio-Language Models

SoundMind: RL-Incentivized Logic Reasoning for Audio-Language Models

URL: http://arxiv.org/abs/2506.12935v1
Date: Sun, 15 Jun 2025 18:26:08 GMT
Title: SoundMind: RL-Incentivized Logic Reasoning for Audio-Language Models
Authors: Xingjian Diao, Chunhui Zhang, Keyi Kong, Weiyi Wu, Chiyu Ma, Zhongyu Ouyang, Peijun Qing, Soroush Vosoughi, Jiang Gui,
Abstract summary: We introduce the Audio Logical Reasoning dataset, consisting of 6,446 text-audio annotated samples.<n>We then propose SoundMind, a rule-based reinforcement learning algorithm tailored to endow ALMs with deep bimodal reasoning abilities.<n>Our approach achieves state-of-the-art performance in audio logical reasoning.
Score: 25.143840124269193
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While large language models have shown reasoning capabilities, their application to the audio modality, particularly in large audio-language models (ALMs), remains significantly underdeveloped. Addressing this gap requires a systematic approach, involving a capable base model, high-quality reasoning-oriented audio data, and effective training algorithms. In this study, we present a comprehensive solution: we introduce the Audio Logical Reasoning (ALR) dataset, consisting of 6,446 text-audio annotated samples specifically designed for complex reasoning tasks. Building on this resource, we propose SoundMind, a rule-based reinforcement learning (RL) algorithm tailored to endow ALMs with deep bimodal reasoning abilities. By training Qwen2.5-Omni-7B on the ALR dataset using SoundMind, our approach achieves state-of-the-art performance in audio logical reasoning. This work highlights the impact of combining high-quality, reasoning-focused datasets with specialized RL techniques, advancing the frontier of auditory intelligence in language models. Our code and the proposed dataset are available at https://github.com/xid32/SoundMind.

Related papers

DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment [94.0709779805955]
We introduce DeSTA2.5-Audio, a general-purpose Large Audio Language Model (LALM)<n>It is designed for robust auditory perception and instruction-following, without requiring task-specific audio instruction-tuning.<n>DeSTA2.5-Audio achieves state-of-the-art or competitive performance across a wide range of audio-language benchmarks.
arXiv Detail & Related papers (2025-07-03T16:28:25Z)
Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model [85.72664004969182]
We introduce Step-Audio-AQAA, a fully end-to-end LALM designed for Audio Query-Audio Answer (AQAA) tasks.<n>The model integrates a dual-codebook audio tokenizer for linguistic and semantic feature extraction.<n>Our post-training approach employs interleaved token-output of text and audio to enhance semantic coherence.
arXiv Detail & Related papers (2025-06-10T16:37:39Z)
From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data [55.2480439325792]
We introduce LISTEN, a contrastive-like training method designed to improve ALLMs' ability to distinguish between present and absent sounds.<n>We also extend BALSa to multi-audio scenarios, where the model either explains the differences between audio inputs or produces a unified caption.<n> Experimental results indicate that our method effectively mitigates audio hallucinations while reliably maintaining strong performance in audio understanding, reasoning, and instruction-following skills.
arXiv Detail & Related papers (2025-05-26T16:08:41Z)
EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning [108.73513190593232]
Multimodal large language models (MLLMs) have advanced perception across text, vision, and audio, yet struggle with structured cross-modal reasoning.<n>We introduce EchoInk-R1, a reinforcement learning framework that enhances such reasoning in MLLMs.
arXiv Detail & Related papers (2025-05-07T17:59:49Z)
Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering [22.88876323500893]
reinforcement learning (RL) has been shown to greatly enhance the reasoning capabilities of large language models (LLMs)<n>We conduct a series of RL explorations in audio understanding and reasoning, specifically focusing on the audio question answering (AQA) task.<n>Our experiments demonstrated state-of-the-art performance on the MMAU Test-mini benchmark, achieving an accuracy rate of 64.5%.
arXiv Detail & Related papers (2025-03-14T08:43:53Z)
Mellow: a small audio language model for reasoning [31.309253699062307]
Mellow is a small Audio-Language Model specifically designed for reasoning.<n>ReasonAQA is a dataset designed to enhance audio-grounded reasoning in models.<n>Our training dataset, findings, and baseline pave the way for developing small ALMs capable of reasoning.
arXiv Detail & Related papers (2025-03-11T15:29:00Z)
Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models [95.45204813682885]
We introduce Audio-Reasoner, a large-scale audio language model for deep reasoning in audio tasks.<n>We train Audio-Reasoner on CoTA, enabling it to achieve great logical capabilities in audio reasoning.<n>Our findings stress the core of structured CoT training in advancing audio reasoning.
arXiv Detail & Related papers (2025-03-04T06:18:34Z)
Large Language Models are Strong Audio-Visual Speech Recognition Learners [53.142635674428874]
Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities.<n>We propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities.<n>We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.79% and 0.77%, respectively.
arXiv Detail & Related papers (2024-09-18T21:17:27Z)
Joint Speech Recognition and Audio Captioning [37.205642807313545]
Speech samples recorded in both indoor and outdoor environments are often contaminated with secondary audio sources. We aim to bring together the growing field of automated audio captioning (AAC) and the thoroughly studied automatic speech recognition (ASR) We propose several approaches for end-to-end joint modeling of ASR and AAC tasks.
arXiv Detail & Related papers (2022-02-03T04:42:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.