WavRAG: Audio-Integrated Retrieval Augmented Generation for Spoken Dialogue Models
- URL: http://arxiv.org/abs/2502.14727v1
- Date: Thu, 20 Feb 2025 16:54:07 GMT
- Title: WavRAG: Audio-Integrated Retrieval Augmented Generation for Spoken Dialogue Models
- Authors: Yifu Chen, Shengpeng Ji, Haoxiao Wang, Ziqing Wang, Siyu Chen, Jinzheng He, Jin Xu, Zhou Zhao,
- Abstract summary: WavRAG is the first retrieval augmented generation framework with native, end-to-end audio support.
We propose the WavRetriever to facilitate the retrieval from a text-audio hybrid knowledge base.
In comparison to state-of-the-art ASR-Text RAG pipelines, WavRAG achieves comparable retrieval performance while delivering a 10x acceleration.
- Score: 49.725968706743586
- License:
- Abstract: Retrieval Augmented Generation (RAG) has gained widespread adoption owing to its capacity to empower large language models (LLMs) to integrate external knowledge. However, existing RAG frameworks are primarily designed for text-based LLMs and rely on Automatic Speech Recognition to process speech input, which discards crucial audio information, risks transcription errors, and increases computational overhead. Therefore, we introduce WavRAG, the first retrieval augmented generation framework with native, end-to-end audio support. WavRAG offers two key features: 1) Bypassing ASR, WavRAG directly processes raw audio for both embedding and retrieval. 2) WavRAG integrates audio and text into a unified knowledge representation. Specifically, we propose the WavRetriever to facilitate the retrieval from a text-audio hybrid knowledge base, and further enhance the in-context capabilities of spoken dialogue models through the integration of chain-of-thought reasoning. In comparison to state-of-the-art ASR-Text RAG pipelines, WavRAG achieves comparable retrieval performance while delivering a 10x acceleration. Furthermore, WavRAG's unique text-audio hybrid retrieval capability extends the boundaries of RAG to the audio modality.
Related papers
- MoHAVE: Mixture of Hierarchical Audio-Visual Experts for Robust Speech Recognition [23.406334722946163]
MoHAVE (Mixture of Hierarchical Audio-Visual Experts) is a novel robust AVSR framework designed to address scalability constraints.
MoHAVE activates modality-specific expert groups, ensuring dynamic adaptation to various audio-visual inputs with minimal computational overhead.
arXiv Detail & Related papers (2025-02-11T11:01:05Z) - Speech Retrieval-Augmented Generation without Automatic Speech Recognition [4.731446054087683]
SpeechRAG is a novel framework designed for open-question answering over spoken data.
Our proposed approach fine-tunes a pre-trained speech encoder into a speech adapter fed into a frozen large language model.
By aligning the embedding spaces of text and speech, our speech retriever directly retrieves audio passages from text-based queries.
arXiv Detail & Related papers (2024-12-21T06:16:04Z) - Audio Captioning RAG via Generative Pair-to-Pair Retrieval with Refined Knowledge Base [0.0]
Retrieval-Augmented Generation (RAG) retrieves audio-text pairs from a knowledge base and augments them with query audio to generate accurate textual responses.
We propose generative pair-to-pair retrieval, which uses the generated caption as a text query to accurately find relevant audio-text pairs.
Our approach achieves state-of-the-art results on benchmarks including AudioCaps, Clotho, and Auto-ACD.
arXiv Detail & Related papers (2024-10-14T04:57:32Z) - Large Language Models Are Strong Audio-Visual Speech Recognition Learners [53.142635674428874]
Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities.
We propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities.
We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.81% and 0.77%, respectively.
arXiv Detail & Related papers (2024-09-18T21:17:27Z) - Speculative Speech Recognition by Audio-Prefixed Low-Rank Adaptation of Language Models [21.85677682584916]
speculative speech recognition (SSR)
We propose a model which does SSR by combining a RNN-Transducer-based ASR system with an audio-ed language model (LM)
arXiv Detail & Related papers (2024-07-05T16:52:55Z) - WavLLM: Towards Robust and Adaptive Speech Large Language Model [93.0773293897888]
We introduce WavLLM, a robust and adaptive speech large language model with dual encoders, and a prompt-aware LoRA weight adapter.
We validate the proposed model on universal speech benchmarks including tasks such as ASR, ST, SV, ER, and also apply it to specialized datasets like Gaokao English listening comprehension set for SQA, and speech Chain-of-Thought (CoT) evaluation set.
arXiv Detail & Related papers (2024-03-31T12:01:32Z) - One model to rule them all ? Towards End-to-End Joint Speaker
Diarization and Speech Recognition [50.055765860343286]
This paper presents a novel framework for joint speaker diarization and automatic speech recognition.
The framework, named SLIDAR, can process arbitrary length inputs and can handle any number of speakers.
Experiments performed on monaural recordings from the AMI corpus confirm the effectiveness of the method in both close-talk and far-field speech scenarios.
arXiv Detail & Related papers (2023-10-02T23:03:30Z) - Exploring the Integration of Speech Separation and Recognition with
Self-Supervised Learning Representation [83.36685075570232]
This work provides an insightful investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end.
We explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model.
A proposed integration using TF-GridNet-based complex spectral mapping and WavLM-based SSLR achieves a 2.5% word error rate in reverberant WHAMR! test set.
arXiv Detail & Related papers (2023-07-23T05:39:39Z) - Knowledge-Aware Audio-Grounded Generative Slot Filling for Limited
Annotated Data [61.89520860387473]
We propose a Knowledge-Aware Audio-Grounded generative slot-filling framework, termed KA2G, for task-oriented dialogue (ToD) systems.
KA2G achieves robust and data-efficient slot filling for speech-based ToD by 1) framing it as a text generation task, 2) grounding text generation additionally in the audio modality, and 3) conditioning on available external knowledge.
Experiments, conducted on the standard speech-based single-turn SLURP dataset and a multi-turn dataset extracted from a commercial ToD system, display strong and consistent gains.
arXiv Detail & Related papers (2023-07-04T15:05:42Z) - Joint Speech Recognition and Audio Captioning [37.205642807313545]
Speech samples recorded in both indoor and outdoor environments are often contaminated with secondary audio sources.
We aim to bring together the growing field of automated audio captioning (AAC) and the thoroughly studied automatic speech recognition (ASR)
We propose several approaches for end-to-end joint modeling of ASR and AAC tasks.
arXiv Detail & Related papers (2022-02-03T04:42:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.