MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders
- URL: http://arxiv.org/abs/2409.06635v2
- Date: Mon, 23 Sep 2024 03:04:27 GMT
- Title: MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders
- Authors: Wenyu Zhang, Shuo Sun, Bin Wang, Xunlong Zou, Zhuohan Liu, Yingxu He, Geyu Lin, Nancy F. Chen, Ai Ti Aw,
- Abstract summary: We propose to incorporate mixtures of weak' encoders into the AudioLLM framework.
MoWE supplements a base encoder with a pool of relatively light weight encoders, selectively activated based on the audio input to enhance feature extraction without significantly increasing model size.
Our empirical results demonstrate that MoWE effectively improves multi-task performance, broadening the applicability of AudioLLMs to more diverse audio tasks.
- Score: 36.528216873338614
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid advancements in large language models (LLMs) have significantly enhanced natural language processing capabilities, facilitating the development of AudioLLMs that process and understand speech and audio inputs alongside text. Existing AudioLLMs typically combine a pre-trained audio encoder with a pre-trained LLM, which are subsequently finetuned on specific audio tasks. However, the pre-trained audio encoder has constrained capacity to capture features for new tasks and datasets. To address this, we propose to incorporate mixtures of `weak' encoders (MoWE) into the AudioLLM framework. MoWE supplements a base encoder with a pool of relatively light weight encoders, selectively activated based on the audio input to enhance feature extraction without significantly increasing model size. Our empirical results demonstrate that MoWE effectively improves multi-task performance, broadening the applicability of AudioLLMs to more diverse audio tasks.
Related papers
- Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models [56.776580717999806]
Real-world applications often involve processing multiple audio streams simultaneously.
We propose the first multi-audio evaluation benchmark that consists of 20 datasets from 11 multi-audio tasks.
We propose a novel multi-audio-LLM (MALLM) to capture audio context among multiple similar audios.
arXiv Detail & Related papers (2024-09-27T12:06:53Z) - Large Language Models Are Strong Audio-Visual Speech Recognition Learners [53.142635674428874]
Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities.
We propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities.
We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.81% and 0.77%, respectively.
arXiv Detail & Related papers (2024-09-18T21:17:27Z) - Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model [36.61105228468503]
X-Codec incorporates semantic features from a pre-trained semantic encoder before the Residual Vector Quantization stage.
X-Codec significantly reduces WER in speech synthesis tasks and extends these benefits to non-speech applications.
Our experiments in text-to-speech, music continuation, and text-to-sound tasks demonstrate that integrating semantic information substantially improves the overall performance of language models in audio generation.
arXiv Detail & Related papers (2024-08-30T10:24:07Z) - Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time [73.7845280328535]
We present Meerkat, an audio-visual LLM equipped with a fine-grained understanding of image and audio.
Meerkat can tackle challenging tasks such as audio referred image grounding, image guided audio temporal localization, and audio-visual fact-checking.
We achieve state-of-the-art performance on all these downstream tasks with a relative improvement of up to 37.12%.
arXiv Detail & Related papers (2024-07-01T23:32:25Z) - Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding [30.46616330202622]
Automated audio captioning (AAC) is an audio-to-text task to describe audio contents in natural language.
Recent advancements in large language models (LLMs) have opened up possibilities for improving AAC.
Our method obtains a 33.0 SPIDEr-FL score, outperforming the winner of DCASE 2023 Task 6A.
arXiv Detail & Related papers (2024-06-19T07:09:46Z) - LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT [65.69648099999439]
Generative Pre-trained Transformer (GPT) models have achieved remarkable performance on various natural language processing tasks.
We propose LauraGPT, a novel unified audio-and-text GPT-based LLM for audio recognition, understanding, and generation.
arXiv Detail & Related papers (2023-10-07T03:17:59Z) - On decoder-only architecture for speech-to-text and large language model
integration [59.49886892602309]
Speech-LLaMA is a novel approach that effectively incorporates acoustic information into text-based large language models.
We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines.
arXiv Detail & Related papers (2023-07-08T06:47:58Z) - Text-to-Audio Generation using Instruction-Tuned LLM and Latent
Diffusion Model [23.058939018350603]
Large language models (LLM) allow many interesting properties, such as, instruction- and chain-of-thought-based fine-tuning.
We adopt such an instruction-tuned LLM Flan-T5 as the text encoder for text-to-audio (TTA) generation.
Our approach TANGO outperforms the state-of-the-art AudioLDM on most metrics and stays comparable on the rest on AudioCaps test set.
arXiv Detail & Related papers (2023-04-24T07:45:28Z) - Automatic Audio Captioning using Attention weighted Event based
Embeddings [25.258177951665594]
We propose an encoder-decoder architecture with light-weight (i.e. with lesser learnable parameters) Bi-LSTM recurrent layers for AAC.
Our results show that an efficient AED based embedding extractor combined with temporal attention and augmentation techniques is able to surpass existing literature.
arXiv Detail & Related papers (2022-01-28T05:54:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.