Related papers: Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models

Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models

URL: http://arxiv.org/abs/2409.18680v3
Date: Wed, 6 Nov 2024 10:27:05 GMT
Title: Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models
Authors: Yiming Chen, Xianghu Yue, Xiaoxue Gao, Chen Zhang, Luis Fernando D'Haro, Robby T. Tan, Haizhou Li,
Abstract summary: Real-world applications often involve processing multiple audio streams simultaneously. We propose the first multi-audio evaluation benchmark that consists of 20 datasets from 11 multi-audio tasks. We propose a novel multi-audio-LLM (MALLM) to capture audio context among multiple similar audios.
Score: 56.776580717999806
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Various audio-LLMs (ALLMs) have been explored recently for tackling different audio tasks simultaneously using a single, unified model. While existing evaluations of ALLMs primarily focus on single-audio tasks, real-world applications often involve processing multiple audio streams simultaneously. To bridge this gap, we propose the first multi-audio evaluation (MAE) benchmark that consists of 20 datasets from 11 multi-audio tasks encompassing both speech and sound scenarios. Comprehensive experiments on MAE demonstrate that the existing ALLMs, while being powerful in comprehending primary audio elements in individual audio inputs, struggling to handle multi-audio scenarios. To this end, we propose a novel multi-audio-LLM (MALLM) to capture audio context among multiple similar audios using discriminative learning on our proposed synthetic data. The results demonstrate that the proposed MALLM outperforms all baselines and achieves high data efficiency using synthetic data without requiring human annotations. The proposed MALLM opens the door for ALLMs towards multi-audio processing era and brings us closer to replicating human auditory capabilities in machines.

Related papers

SAM Audio: Segment Anything in Audio [55.50609519820557]
General audio source separation is a key capability for multimodal AI systems.<n>We present SAM Audio, a foundation model for general audio separation.<n>It unifies text, visual, and temporal span prompting within a single framework.
arXiv Detail & Related papers (2025-12-19T22:14:23Z)
UALM: Unified Audio Language Model for Understanding, Generation and Reasoning [124.19449187588832]
Unified Audio Language Model (UALM) aims to unify audio understanding, text-to-audio generation, and multimodal reasoning in a single model.<n>We first present UALM-Gen, a text-to-audio language model that directly predicts audio tokens and is comparable to state-of-the-art diffusion-based models.<n>We present UALM-Reason, a multimodal reasoning model that utilizes both text and audio in the intermediate thinking steps to facilitate complex generation tasks.
arXiv Detail & Related papers (2025-10-13T22:55:01Z)
Audio Does Matter: Importance-Aware Multi-Granularity Fusion for Video Moment Retrieval [33.114796739109075]
Video Moment Retrieval (VMR) aims to retrieve a specific moment semantically related to a given query.<n>Most existing VMR methods solely focus on the visual and textual modalities while neglecting the complementary but important audio modality.<n>We propose a novel Importance-aware Multi-Granularity fusion model (IMG), which learns to dynamically and selectively aggregate the audio-vision-text contexts for VMR.
arXiv Detail & Related papers (2025-08-06T09:58:43Z)
USAD: Universal Speech and Audio Representation via Distillation [56.91647396619358]
Universal Speech and Audio Distillation (USAD) is a unified approach to audio representation learning.<n>USAD integrates diverse audio types - speech, sound, and music - into a single model.
arXiv Detail & Related papers (2025-06-23T17:02:00Z)
Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model [85.72664004969182]
We introduce Step-Audio-AQAA, a fully end-to-end LALM designed for Audio Query-Audio Answer (AQAA) tasks.<n>The model integrates a dual-codebook audio tokenizer for linguistic and semantic feature extraction.<n>Our post-training approach employs interleaved token-output of text and audio to enhance semantic coherence.
arXiv Detail & Related papers (2025-06-10T16:37:39Z)
From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data [55.2480439325792]
Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs.<n>These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks.<n>We propose a data generation framework that produces contrastive-like training data, designed to enhance ALLMs' ability to differentiate between present and absent sounds.
arXiv Detail & Related papers (2025-05-26T16:08:41Z)
Make Some Noise: Towards LLM audio reasoning and generation using sound tokens [19.48089933713418]
We introduce a novel approach that combines Variational Quantization with Flow Matching to convert audio into ultra-low discrete tokens of 0.23kpbs. Our tokenizer outperforms a traditional VQ-VAE across various datasets with diverse acoustic events.
arXiv Detail & Related papers (2025-03-28T09:43:47Z)
AudioX: Diffusion Transformer for Anything-to-Audio Generation [72.84633243365093]
AudioX is a unified Diffusion Transformer model for Anything-to-Audio and Music Generation. It can generate both general audio and music with high quality, while offering flexible natural language control. To address data scarcity, we curate two datasets: vggsound-caps with 190K audio captions based on the VGGSound dataset, and V2M-caps with 6 million music captions derived from the V2M dataset.
arXiv Detail & Related papers (2025-03-13T16:30:59Z)
UniForm: A Unified Multi-Task Diffusion Transformer for Audio-Video Generation [44.21422404659117]
UniForm is a unified multi-task diffusion transformer that jointly generates audio and visual modalities in a shared latent space. A single diffusion process models both audio and video, capturing the inherent correlations between sound and vision. By leveraging large language models and a large-scale text-audio-video combined dataset, UniForm achieves greater generative diversity than prior approaches.
arXiv Detail & Related papers (2025-02-06T09:18:30Z)
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information? [65.49972312524724]
multimodal large language models (MLLMs) have expanded their capabilities to include vision and audio modalities. Our proposed DeafTest reveals that MLLMs often struggle with simple tasks humans find trivial. We introduce AV-Odyssey Bench, a comprehensive audio-visual benchmark designed to assess whether those MLLMs can truly understand the audio-visual information.
arXiv Detail & Related papers (2024-12-03T17:41:23Z)
Audio-Visual Talker Localization in Video for Spatial Sound Reproduction [3.2472293599354596]
In this research, we detect and locate the active speaker in the video. We found the role of the two modalities to complement each other. Future investigations will assess the robustness of the model in noisy and highly reverberant environments.
arXiv Detail & Related papers (2024-06-01T16:47:07Z)
Multimodal Data and Resource Efficient Device-Directed Speech Detection with Large Foundation Models [43.155061160275196]
We explore the possibility of making interactions with virtual assistants more natural by eliminating the need for a trigger phrase. Our goal is to determine whether a user addressed the virtual assistant based on signals obtained from the streaming audio recorded by the device microphone. We address this task by combining 1-best hypotheses and decoder signals from an automatic speech recognition system with acoustic representations from an audio encoder.
arXiv Detail & Related papers (2023-12-06T17:29:03Z)
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models [98.34889301515412]
We develop the Qwen-Audio model and address the limitation by scaling up audio-language pre-training to cover over 30 tasks and various audio types. Qwen-Audio achieves impressive performance across diverse benchmark tasks without requiring any task-specific fine-tuning. We further develop Qwen-Audio-Chat, which allows for input from various audios and text inputs, enabling multi-turn dialogues and supporting various audio-central scenarios.
arXiv Detail & Related papers (2023-11-14T05:34:50Z)
Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning [50.28566759231076]
We propose an innovative, automatic approach to establish an audio dataset with high-quality captions. Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs. We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
arXiv Detail & Related papers (2023-09-20T17:59:32Z)
AudioFormer: Audio Transformer learns audio feature representations from discrete acoustic codes [6.375996974877916]
We propose a method named AudioFormer, which learns audio feature representations through the acquisition of discrete acoustic codes. Our research outcomes demonstrate that AudioFormer attains significantly improved performance compared to prevailing monomodal audio classification models.
arXiv Detail & Related papers (2023-08-14T15:47:25Z)
Separate Anything You Describe [55.0784713558149]
Language-queried audio source separation (LASS) is a new paradigm for computational auditory scene analysis (CASA) AudioSep is a foundation model for open-domain audio source separation with natural language queries.
arXiv Detail & Related papers (2023-08-09T16:09:44Z)
AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head [82.69233563811487]
Large language models (LLMs) have exhibited remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. We propose a multi-modal AI system named AudioGPT, which complements LLMs with foundation models to process complex audio information.
arXiv Detail & Related papers (2023-04-25T17:05:38Z)
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation. Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data. We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z)
Large Scale Audiovisual Learning of Sounds with Weakly Labeled Data [9.072124914105325]
We present an audiovisual fusion model that learns to recognize sounds from weakly labeled video recordings. Experiments on the large scale sound events dataset, AudioSet, demonstrate the efficacy of the proposed model.
arXiv Detail & Related papers (2020-05-29T01:30:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.