Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models
- URL: http://arxiv.org/abs/2503.02318v1
- Date: Tue, 04 Mar 2025 06:18:34 GMT
- Title: Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models
- Authors: Zhifei Xie, Mingbao Lin, Zihang Liu, Pengcheng Wu, Shuicheng Yan, Chunyan Miao,
- Abstract summary: We introduce Audio-Reasoner, a large-scale audio language model for deep reasoning in audio tasks.<n>We train Audio-Reasoner on CoTA, enabling it to achieve great logical capabilities in audio reasoning.<n>Our findings stress the core of structured CoT training in advancing audio reasoning.
- Score: 95.45204813682885
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in multimodal reasoning have largely overlooked the audio modality. We introduce Audio-Reasoner, a large-scale audio language model for deep reasoning in audio tasks. We meticulously curated a large-scale and diverse multi-task audio dataset with simple annotations. Then, we leverage closed-source models to conduct secondary labeling, QA generation, along with structured COT process. These datasets together form a high-quality reasoning dataset with 1.2 million reasoning-rich samples, which we name CoTA. Following inference scaling principles, we train Audio-Reasoner on CoTA, enabling it to achieve great logical capabilities in audio reasoning. Experiments show state-of-the-art performance across key benchmarks, including MMAU-mini (+25.42%), AIR-Bench chat/foundation(+14.57%/+10.13%), and MELD (+8.01%). Our findings stress the core of structured CoT training in advancing audio reasoning.
Related papers
- SoundMind: RL-Incentivized Logic Reasoning for Audio-Language Models [25.143840124269193]
We introduce the Audio Logical Reasoning dataset, consisting of 6,446 text-audio annotated samples.<n>We then propose SoundMind, a rule-based reinforcement learning algorithm tailored to endow ALMs with deep bimodal reasoning abilities.<n>Our approach achieves state-of-the-art performance in audio logical reasoning.
arXiv Detail & Related papers (2025-06-15T18:26:08Z) - FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal Contextual Fusion [14.43138123445589]
High-quality, large-scale audio captioning is crucial for advancing audio understanding.<n>Current automated methods often generate captions that lack fine-grained detail and contextual accuracy.<n>This paper paves the way for more nuanced and accurate automated understanding of complex audio environments.
arXiv Detail & Related papers (2025-06-01T18:29:17Z) - From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data [55.2480439325792]
Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs.<n>These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks.<n>We propose a data generation framework that produces contrastive-like training data, designed to enhance ALLMs' ability to differentiate between present and absent sounds.
arXiv Detail & Related papers (2025-05-26T16:08:41Z) - Multi-Domain Audio Question Answering Toward Acoustic Content Reasoning in The DCASE 2025 Challenge [102.84031769492708]
This task defines three QA subsets to test audio-language models on interactive question-answering over diverse acoustic scenes.<n>Preliminary results on the development set are compared, showing strong variation across models and subsets.<n>This challenge aims to advance the audio understanding and reasoning capabilities of audio-language models toward human-level acuity.
arXiv Detail & Related papers (2025-05-12T09:04:16Z) - Kimi-Audio Technical Report [67.69331679172303]
Kimi-Audio is an open-source audio foundation model that excels in audio understanding, generation, and conversation.
We detail the practices in building Kimi-Audio, including model architecture, data curation, training recipe, inference deployment, and evaluation.
arXiv Detail & Related papers (2025-04-25T15:31:46Z) - Mellow: a small audio language model for reasoning [31.309253699062307]
Mellow is a small Audio-Language Model specifically designed for reasoning.
ReasonAQA is a dataset designed to enhance audio-grounded reasoning in models.
Our training dataset, findings, and baseline pave the way for developing small ALMs capable of reasoning.
arXiv Detail & Related papers (2025-03-11T15:29:00Z) - Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities [72.91296768332163]
We introduce Audio Flamingo 2 (AF2), an Audio-Language Model, and LongAudio, a dataset for training ALMs on long audio captioning and question-answering tasks.
AF2 achieves state-of-the-art performance with only a 3B parameter small language model, surpassing large open-source and proprietary models across over 20 benchmarks.
For the first time, we extend audio understanding to long audio segments (30 secs to 5 mins) and propose LongAudio, a large and novel dataset for training ALMs on long audio captioning and question-answering tasks.
arXiv Detail & Related papers (2025-03-06T00:10:26Z) - Evaluation of Deep Audio Representations for Hearables [1.5646349560044959]
This dataset includes 1,158 audio tracks, each 30 seconds long, created by spatially mixing proprietary monologues with high-quality recordings of everyday acoustic scenes.<n>Our benchmark encompasses eight tasks that assess the general context, speech sources, and technical acoustic properties of the audio scenes.<n>This superiority underscores the advantage of models trained on diverse audio collections, confirming their applicability to a wide array of auditory tasks, including encoding the environment properties necessary for hearable steering.
arXiv Detail & Related papers (2025-02-10T16:51:11Z) - Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models [56.776580717999806]
Real-world applications often involve processing multiple audio streams simultaneously.
We propose the first multi-audio evaluation benchmark that consists of 20 datasets from 11 multi-audio tasks.
We propose a novel multi-audio-LLM (MALLM) to capture audio context among multiple similar audios.
arXiv Detail & Related papers (2024-09-27T12:06:53Z) - GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities [43.23351906406144]
General-purpose Large Audio-Language Model (LALM) with Advanced Audio Understanding and Complex Reasoning Abilities.
We build GAMA by integrating an LLM with multiple types of audio representations, including features from a custom Audio Q-Former.
We fine-tune GAMA on a large-scale audio-language dataset, which augments it with audio understanding capabilities.
arXiv Detail & Related papers (2024-06-17T17:31:01Z) - Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning [50.28566759231076]
We propose an innovative, automatic approach to establish an audio dataset with high-quality captions.
Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs.
We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
arXiv Detail & Related papers (2023-09-20T17:59:32Z) - AudioFormer: Audio Transformer learns audio feature representations from
discrete acoustic codes [6.375996974877916]
We propose a method named AudioFormer, which learns audio feature representations through the acquisition of discrete acoustic codes.
Our research outcomes demonstrate that AudioFormer attains significantly improved performance compared to prevailing monomodal audio classification models.
arXiv Detail & Related papers (2023-08-14T15:47:25Z) - Separate Anything You Describe [53.30484933564858]
Language-queried audio source separation (LASS) is a new paradigm for computational auditory scene analysis (CASA)
AudioSep is a foundation model for open-domain audio source separation with natural language queries.
arXiv Detail & Related papers (2023-08-09T16:09:44Z) - Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion
Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation.
Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data.
We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.