Related papers: Audiopedia: Audio QA with Knowledge

Audiopedia: Audio QA with Knowledge

URL: http://arxiv.org/abs/2412.20619v1
Date: Sun, 29 Dec 2024 23:48:35 GMT
Title: Audiopedia: Audio QA with Knowledge
Authors: Abhirama Subramanyam Penamakuri, Kiran Chhatre, Akshat Jain,
Abstract summary: We introduce Audiopedia, a novel task called Audio Question Answering with Knowledge.<n>Unlike traditional Audio Question Answering (AQA) benchmarks that focus on simple queries answerable from audio alone, Audiopedia targets knowledge-intensive questions.<n>We benchmark large audio language models (LALMs) on these sub-tasks and observe suboptimal performance.<n>We propose a generic framework that can be adapted to any LALM, equipping them with knowledge reasoning capabilities.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: In this paper, we introduce Audiopedia, a novel task called Audio Question Answering with Knowledge, which requires both audio comprehension and external knowledge reasoning. Unlike traditional Audio Question Answering (AQA) benchmarks that focus on simple queries answerable from audio alone, Audiopedia targets knowledge-intensive questions. We define three sub-tasks: (i) Single Audio Question Answering (s-AQA), where questions are answered based on a single audio sample, (ii) Multi-Audio Question Answering (m-AQA), which requires reasoning over multiple audio samples, and (iii) Retrieval-Augmented Audio Question Answering (r-AQA), which involves retrieving relevant audio to answer the question. We benchmark large audio language models (LALMs) on these sub-tasks and observe suboptimal performance. To address this, we propose a generic framework that can be adapted to any LALM, equipping them with knowledge reasoning capabilities. Our framework has two components: (i) Audio Entity Linking (AEL) and (ii) Knowledge-Augmented Audio Large Multimodal Model (KA2LM), which together improve performance on knowledge-intensive AQA tasks. To our knowledge, this is the first work to address advanced audio understanding via knowledge-intensive tasks like Audiopedia.

Related papers

Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities [72.91296768332163]
We introduce Audio Flamingo 2 (AF2), an Audio-Language Model, and LongAudio, a dataset for training ALMs on long audio captioning and question-answering tasks. AF2 achieves state-of-the-art performance with only a 3B parameter small language model, surpassing large open-source and proprietary models across over 20 benchmarks. For the first time, we extend audio understanding to long audio segments (30 secs to 5 mins) and propose LongAudio, a large and novel dataset for training ALMs on long audio captioning and question-answering tasks.
arXiv Detail & Related papers (2025-03-06T00:10:26Z)
Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models [56.776580717999806]
Real-world applications often involve processing multiple audio streams simultaneously. We propose the first multi-audio evaluation benchmark that consists of 20 datasets from 11 multi-audio tasks. We propose a novel multi-audio-LLM (MALLM) to capture audio context among multiple similar audios.
arXiv Detail & Related papers (2024-09-27T12:06:53Z)
Enhancing Temporal Understanding in Audio Question Answering for Large Audio Language Models [0.9285295512807729]
The Audio Question Answering (AQA) task includes audio event classification, audio captioning, and open-ended reasoning.<n>LALMs excel in general audio understanding, but are limited in temporal reasoning.<n>This paper addresses these challenges and limitations in audio temporal reasoning.
arXiv Detail & Related papers (2024-09-10T05:26:53Z)
GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities [43.23351906406144]
General-purpose Large Audio-Language Model (LALM) with Advanced Audio Understanding and Complex Reasoning Abilities. We build GAMA by integrating an LLM with multiple types of audio representations, including features from a custom Audio Q-Former. We fine-tune GAMA on a large-scale audio-language dataset, which augments it with audio understanding capabilities.
arXiv Detail & Related papers (2024-06-17T17:31:01Z)
Answering Diverse Questions via Text Attached with Key Audio-Visual Clues [24.347420432207283]
We propose a framework for performing mutual correlation distillation (MCD) to aid question inference. We evaluate the proposed method on two publicly available datasets containing multiple question-and-answer pairs.
arXiv Detail & Related papers (2024-03-11T12:51:37Z)
CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios [69.94398424864595]
This paper focuses on the challenge of answering questions in scenarios composed of rich and complex dynamic audio-visual components. We introduce the CAT, which enhances Multimodal Large Language Models (MLLMs) in three ways. CAT is trained on a mixed multimodal dataset, allowing direct application in audio-visual scenarios.
arXiv Detail & Related papers (2024-03-07T16:31:02Z)
Separate Anything You Describe [53.30484933564858]
Language-queried audio source separation (LASS) is a new paradigm for computational auditory scene analysis (CASA) AudioSep is a foundation model for open-domain audio source separation with natural language queries.
arXiv Detail & Related papers (2023-08-09T16:09:44Z)
AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head [82.69233563811487]
Large language models (LLMs) have exhibited remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. We propose a multi-modal AI system named AudioGPT, which complements LLMs with foundation models to process complex audio information.
arXiv Detail & Related papers (2023-04-25T17:05:38Z)
Learning to Answer Questions in Dynamic Audio-Visual Scenarios [81.19017026999218]
We focus on the Audio-Visual Questioning (AVQA) task, which aims to answer questions regarding different visual objects sounds, and their associations in videos. Our dataset contains more than 45K question-answer pairs spanning over different modalities and question types. Our results demonstrate that AVQA benefits from multisensory perception and our model outperforms recent A-SIC, V-SIC, and AVQA approaches.
arXiv Detail & Related papers (2022-03-26T13:03:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.