Related papers: IFEval-Audio: Benchmarking Instruction-Following Capability in Audio-based Large Language Models

IFEval-Audio: Benchmarking Instruction-Following Capability in Audio-based Large Language Models

URL: http://arxiv.org/abs/2505.16774v1
Date: Thu, 22 May 2025 15:15:29 GMT
Title: IFEval-Audio: Benchmarking Instruction-Following Capability in Audio-based Large Language Models
Authors: Yiming Gao, Bin Wang, Chengwei Wei, Shuo Sun, AiTi Aw,
Abstract summary: IFEval-Audio contains 280 audio-instruction-answer triples across six diverse dimensions.<n>Each example pairs an audio input with a text instruction, requiring the model to generate an output that follows a specified structure.<n>We benchmark state-of-the-art audio LLMs on their ability to follow audio-involved instructions.
Score: 18.11667976818302
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have demonstrated strong instruction-following capabilities in text-based tasks. However, this ability often deteriorates in multimodal models after alignment with non-text modalities such as images or audio. While several recent efforts have investigated instruction-following performance in text and vision-language models, instruction-following in audio-based large language models remains largely unexplored. To bridge this gap, we introduce IFEval-Audio, a novel evaluation dataset designed to assess the ability to follow instructions in an audio LLM. IFEval-Audio contains 280 audio-instruction-answer triples across six diverse dimensions: Content, Capitalization, Symbol, List Structure, Length, and Format. Each example pairs an audio input with a text instruction, requiring the model to generate an output that follows a specified structure. We benchmark state-of-the-art audio LLMs on their ability to follow audio-involved instructions. The dataset is released publicly to support future research in this emerging area.

Related papers

DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment [94.0709779805955]
We introduce DeSTA2.5-Audio, a general-purpose Large Audio Language Model (LALM)<n>It is designed for robust auditory perception and instruction-following, without requiring task-specific audio instruction-tuning.<n>DeSTA2.5-Audio achieves state-of-the-art or competitive performance across a wide range of audio-language benchmarks.
arXiv Detail & Related papers (2025-07-03T16:28:25Z)
From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data [55.2480439325792]
Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs.<n>These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks.<n>We propose a data generation framework that produces contrastive-like training data, designed to enhance ALLMs' ability to differentiate between present and absent sounds.
arXiv Detail & Related papers (2025-05-26T16:08:41Z)
Solla: Towards a Speech-Oriented LLM That Hears Acoustic Context [45.56363286769136]
We introduce Solla, a novel framework designed to understand speech-based questions and hear the acoustic context concurrently.<n>Solla incorporates an audio tagging module to effectively identify and represent audio events, as well as an ASR-assisted prediction method to improve comprehension of spoken content.<n>We propose a new benchmark dataset called SA-Eval, which includes three tasks: audio event classification, audio captioning, and audio question answering.
arXiv Detail & Related papers (2025-03-19T15:34:21Z)
AudioSetMix: Enhancing Audio-Language Datasets with LLM-Assisted Augmentations [1.2101820447447276]
Multi-modal learning in the audio-language domain has seen significant advancements in recent years. However, audio-language learning faces challenges due to limited and lower-quality data compared to image-language tasks. Our method systematically generates audio-caption pairs by augmenting audio clips with natural language labels and corresponding audio signal processing operations. This scalable method produces AudioSetMix, a high-quality training dataset for text-and-audio related models.
arXiv Detail & Related papers (2024-05-17T21:08:58Z)
LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT [65.69648099999439]
Generative Pre-trained Transformer (GPT) models have achieved remarkable performance on various natural language processing tasks. We propose LauraGPT, a novel unified audio-and-text GPT-based LLM for audio recognition, understanding, and generation.
arXiv Detail & Related papers (2023-10-07T03:17:59Z)
Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning [50.28566759231076]
We propose an innovative, automatic approach to establish an audio dataset with high-quality captions. Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs. We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
arXiv Detail & Related papers (2023-09-20T17:59:32Z)
Retrieval-Augmented Text-to-Audio Generation [36.328134891428085]
We show that the state-of-the-art models, such as AudioLDM, are biased in their generation performance. We propose a simple retrieval-augmented approach for TTA models. We show that Re-AudioLDM can generate realistic audio for complex scenes, rare audio classes, and even unseen audio types.
arXiv Detail & Related papers (2023-09-14T22:35:39Z)
Separate Anything You Describe [53.30484933564858]
Language-queried audio source separation (LASS) is a new paradigm for computational auditory scene analysis (CASA)<n>AudioSep is a foundation model for open-domain audio source separation with natural language queries.
arXiv Detail & Related papers (2023-08-09T16:09:44Z)
AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models. It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z)
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation. Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data. We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.