Related papers: Mellow: a small audio language model for reasoning

Mellow: a small audio language model for reasoning

URL: http://arxiv.org/abs/2503.08540v1
Date: Tue, 11 Mar 2025 15:29:00 GMT
Title: Mellow: a small audio language model for reasoning
Authors: Soham Deshmukh, Satvik Dixit, Rita Singh, Bhiksha Raj,
Abstract summary: Mellow is a small Audio-Language Model specifically designed for reasoning.<n>ReasonAQA is a dataset designed to enhance audio-grounded reasoning in models.<n>Our training dataset, findings, and baseline pave the way for developing small ALMs capable of reasoning.
Score: 31.309253699062307
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Audio-Language Models (ALMs) can understand and reason over both audio and text. Typically, reasoning performance correlates with model size, with the best results achieved by models exceeding 8 billion parameters. However, no prior work has explored enabling small audio-language models to perform reasoning tasks, despite the potential applications for edge devices. To address this gap, we introduce Mellow, a small Audio-Language Model specifically designed for reasoning. Mellow achieves state-of-the-art performance among existing small audio-language models and surpasses several larger models in reasoning capabilities. For instance, Mellow scores 52.11 on MMAU, comparable to SoTA Qwen2 Audio (which scores 52.5) while using 50 times fewer parameters and being trained on 60 times less data (audio hrs). To train Mellow, we introduce ReasonAQA, a dataset designed to enhance audio-grounded reasoning in models. It consists of a mixture of existing datasets (30% of the data) and synthetically generated data (70%). The synthetic dataset is derived from audio captioning datasets, where Large Language Models (LLMs) generate detailed and multiple-choice questions focusing on audio events, objects, acoustic scenes, signal properties, semantics, and listener emotions. To evaluate Mellow's reasoning ability, we benchmark it on a diverse set of tasks, assessing on both in-distribution and out-of-distribution data, including audio understanding, deductive reasoning, and comparative reasoning. Finally, we conduct extensive ablation studies to explore the impact of projection layer choices, synthetic data generation methods, and language model pretraining on reasoning performance. Our training dataset, findings, and baseline pave the way for developing small ALMs capable of reasoning.

Related papers

Scaling Open Discrete Audio Foundation Models with Interleaved Semantic, Acoustic, and Text Tokens [62.56027815951259]
Current audio language models are predominantly text-first, either extending pre-trained text LLM backbones or relying on semantic-only audio tokens.<n>This paper presents a systematic empirical study of native audio foundation models that apply next-token prediction to audio at scale.
arXiv Detail & Related papers (2026-02-18T18:32:46Z)
UALM: Unified Audio Language Model for Understanding, Generation and Reasoning [124.19449187588832]
Unified Audio Language Model (UALM) aims to unify audio understanding, text-to-audio generation, and multimodal reasoning in a single model.<n>We first present UALM-Gen, a text-to-audio language model that directly predicts audio tokens and is comparable to state-of-the-art diffusion-based models.<n>We present UALM-Reason, a multimodal reasoning model that utilizes both text and audio in the intermediate thinking steps to facilitate complex generation tasks.
arXiv Detail & Related papers (2025-10-13T22:55:01Z)
AHELM: A Holistic Evaluation of Audio-Language Models [78.20477815156484]
multimodal audio-language models (ALMs) take interleaved audio and text as input and output text.<n>AHELM is a benchmark that aggregates various datasets -- including 2 new synthetic audio-text datasets called PARADE and CoRe-Bench.<n>We also standardize the prompts, inference parameters, and evaluation metrics to ensure equitable comparisons across models.
arXiv Detail & Related papers (2025-08-29T07:40:39Z)
SoundMind: RL-Incentivized Logic Reasoning for Audio-Language Models [25.143840124269193]
We introduce the Audio Logical Reasoning dataset, consisting of 6,446 text-audio annotated samples.<n>We then propose SoundMind, a rule-based reinforcement learning algorithm tailored to endow ALMs with deep bimodal reasoning abilities.<n>Our approach achieves state-of-the-art performance in audio logical reasoning.
arXiv Detail & Related papers (2025-06-15T18:26:08Z)
From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data [55.2480439325792]
Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs.<n>These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks.<n>We propose a data generation framework that produces contrastive-like training data, designed to enhance ALLMs' ability to differentiate between present and absent sounds.
arXiv Detail & Related papers (2025-05-26T16:08:41Z)
Teaching Audio-Aware Large Language Models What Does Not Hear: Mitigating Hallucinations through Synthesized Negative Samples [55.2480439325792]
Recent advancements in audio-aware large language models (ALLMs) enable them to process and understand audio inputs.<n>These models often hallucinate non-existent sound events, reducing their reliability in real-world applications.<n>We propose LISTEN, a contrastive-like training method that enhances ALLMs' ability to distinguish between present and absent sounds.
arXiv Detail & Related papers (2025-05-20T15:44:01Z)
Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models [95.45204813682885]
We introduce Audio-Reasoner, a large-scale audio language model for deep reasoning in audio tasks.<n>We train Audio-Reasoner on CoTA, enabling it to achieve great logical capabilities in audio reasoning.<n>Our findings stress the core of structured CoT training in advancing audio reasoning.
arXiv Detail & Related papers (2025-03-04T06:18:34Z)
Nexus-O: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision [50.23246260804145]
We introduce textbfNexus-O, an industry-level textbfomni-perceptive and -interactive model capable of efficiently processing Audio, Image, Video, and Text data. We address three key research questions: First, how can models be efficiently designed and trained to achieve tri-modal alignment, understanding and reasoning capabilities across multiple modalities? Second, what approaches can be implemented to evaluate tri-modal model robustness, ensuring reliable performance and applicability in real-world scenarios? Third, what strategies can be employed to curate and obtain high-quality, real-life scenario
arXiv Detail & Related papers (2025-02-26T17:26:36Z)
Audio Large Language Models Can Be Descriptive Speech Quality Evaluators [46.765203628127345]
We introduce the first natural language-based speech evaluation corpus, generated from authentic human ratings. This corpus offers detailed analysis across multiple dimensions and identifies causes of quality degradation. We propose an alignment approach with LLM distillation (ALLD) to guide the audio LLM in extracting relevant information from raw speech.
arXiv Detail & Related papers (2025-01-27T22:47:51Z)
ETTA: Elucidating the Design Space of Text-to-Audio Models [33.831803213869605]
We study the effects of data, model architecture, training objective functions, and sampling strategies on target benchmarks.<n>We propose our best model dubbed Elucidated Text-To-Audio (ETTA)<n>ETTA provides improvements over the baselines trained on publicly available data, while being competitive with models trained on proprietary data.
arXiv Detail & Related papers (2024-12-26T21:13:12Z)
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information? [65.49972312524724]
multimodal large language models (MLLMs) have expanded their capabilities to include vision and audio modalities.<n>Our proposed DeafTest reveals that MLLMs often struggle with simple tasks humans find trivial.<n>We introduce AV-Odyssey Bench, a comprehensive audio-visual benchmark designed to assess whether those MLLMs can truly understand the audio-visual information.
arXiv Detail & Related papers (2024-12-03T17:41:23Z)
Can Large Audio-Language Models Truly Hear? Tackling Hallucinations with Multi-Task Assessment and Stepwise Audio Reasoning [55.2480439325792]
Large audio-language models (LALMs) have shown impressive capabilities in understanding and reasoning about audio and speech information.<n>These models still face challenges, including hallucinating non-existent sound events, misidentifying the order of sound events, and incorrectly attributing sound sources.
arXiv Detail & Related papers (2024-10-21T15:55:27Z)
Salmon: A Suite for Acoustic Language Model Evaluation [20.802090523583196]
We introduce SALMon, a novel evaluation suite encompassing background noise, emotion, speaker identity and room impulse response.<n>We evaluate several speech language models on SALMon, thus highlighting the strengths and weaknesses of each evaluated method.
arXiv Detail & Related papers (2024-09-11T17:34:52Z)
AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension [95.8442896569132]
We introduce AIR-Bench, the first benchmark to evaluate the ability of Large Audio-Language Models (LALMs) to understand various types of audio signals and interact with humans in the textual format. Results demonstrate a high level of consistency between GPT-4-based evaluation and human evaluation.
arXiv Detail & Related papers (2024-02-12T15:41:22Z)
Analysing the Impact of Audio Quality on the Use of Naturalistic Long-Form Recordings for Infant-Directed Speech Research [62.997667081978825]
Modelling of early language acquisition aims to understand how infants bootstrap their language skills. Recent developments have enabled the use of more naturalistic training data for computational models. It is currently unclear how the sound quality could affect analyses and modelling experiments conducted on such data.
arXiv Detail & Related papers (2023-05-03T08:25:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.