Related papers: VoxRAG: A Step Toward Transcription-Free RAG Systems in Spoken Question Answering

VoxRAG: A Step Toward Transcription-Free RAG Systems in Spoken Question Answering

URL: http://arxiv.org/abs/2505.17326v1
Date: Thu, 22 May 2025 22:42:40 GMT
Title: VoxRAG: A Step Toward Transcription-Free RAG Systems in Spoken Question Answering
Authors: Zackary Rackauckas, Julia Hirschberg,
Abstract summary: We introduce VoxRAG, a modular speech-to-speech retrieval-augmented generation system.<n> VoxRAG bypasses transcription to retrieve semantically relevant audio segments directly from spoken queries.
Score: 4.740589102992697
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We introduce VoxRAG, a modular speech-to-speech retrieval-augmented generation system that bypasses transcription to retrieve semantically relevant audio segments directly from spoken queries. VoxRAG employs silence-aware segmentation, speaker diarization, CLAP audio embeddings, and FAISS retrieval using L2-normalized cosine similarity. We construct a 50-query test set recorded as spoken input by a native English speaker. Retrieval quality was evaluated using LLM-as-a-judge annotations. For very relevant segments, cosine similarity achieved a Recall@10 of 0.34. For somewhat relevant segments, Recall@10 rose to 0.60 and nDCG@10 to 0.27, highlighting strong topical alignment. Answer quality was judged on a 0--2 scale across relevance, accuracy, completeness, and precision, with mean scores of 0.84, 0.58, 0.56, and 0.46 respectively. While precision and retrieval quality remain key limitations, VoxRAG shows that transcription-free speech-to-speech retrieval is feasible in RAG systems.

Related papers

Automatic Speech Recognition for Non-Native English: Accuracy and Disfluency Handling [0.0]
This study assesses five cutting-edge ASR systems' recognition of non-native English accented speech using recordings from the L2-ARCTIC corpus.<n>For read speech, Whisper and AssemblyAI achieved the best accuracy with mean Match Error Rates (MER) of 0.054 and 0.056 respectively.<n>For spontaneous speech, RevAI performed best with a mean MER of 0.063.
arXiv Detail & Related papers (2025-03-10T05:09:44Z)
Speech Retrieval-Augmented Generation without Automatic Speech Recognition [4.731446054087683]
SpeechRAG is a novel framework designed for open-question answering over spoken data.<n>Our proposed approach fine-tunes a pre-trained speech encoder into a speech adapter fed into a frozen large language model.<n>By aligning the embedding spaces of text and speech, our speech retriever directly retrieves audio passages from text-based queries.
arXiv Detail & Related papers (2024-12-21T06:16:04Z)
LA-RAG:Enhancing LLM-based ASR Accuracy with Retrieval-Augmented Generation [15.520180125182756]
Recent advancements in integrating speech information into large language models (LLMs) have significantly improved automatic speech recognition (ASR) accuracy. Existing methods often constrained by the capabilities of the speech encoders under varied acoustic conditions, such as accents. We propose LA-RAG, a novel Retrieval-Augmented Generation (RAG) paradigm for LLM-based ASR.
arXiv Detail & Related papers (2024-09-13T07:28:47Z)
Speech collage: code-switched audio generation by collaging monolingual corpora [50.356820349870986]
Speech Collage is a method that synthesizes CS data from monolingual corpora by splicing audio segments. We investigate the impact of generated data on speech recognition in two scenarios.
arXiv Detail & Related papers (2023-09-27T14:17:53Z)
Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition [52.11964238935099]
An audio-visual multi-channel speech separation, dereverberation and recognition approach is proposed in this paper. Video input is consistently demonstrated in mask-based MVDR speech separation, DNN-WPE or spectral mapping (SpecM) based speech dereverberation front-end. Experiments were conducted on the mixture overlapped and reverberant speech data constructed using simulation or replay of the Oxford LRS2 dataset.
arXiv Detail & Related papers (2023-07-06T10:50:46Z)
Leveraging Cross-Utterance Context For ASR Decoding [6.033324057680156]
Cross utterance information has been shown to be beneficial during second pass re-scoring. We investigate the incorporation of long-context transformer LMs for cross-utterance decoding of acoustic models via beam search.
arXiv Detail & Related papers (2023-06-29T12:48:25Z)
BLASER: A Text-Free Speech-to-Speech Translation Evaluation Metric [66.73705349465207]
End-to-end speech-to-speech translation (S2ST) is generally evaluated with text-based metrics. We propose a text-free evaluation metric for end-to-end S2ST, named BLASER, to avoid the dependency on ASR systems.
arXiv Detail & Related papers (2022-12-16T14:00:26Z)
Introducing Semantics into Speech Encoders [91.37001512418111]
We propose an unsupervised way of incorporating semantic information from large language models into self-supervised speech encoders without labeled audio transcriptions. Our approach achieves similar performance as supervised methods trained on over 100 hours of labeled audio transcripts.
arXiv Detail & Related papers (2022-11-15T18:44:28Z)
Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech. Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network. In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z)
Continuous speech separation: dataset and analysis [52.10378896407332]
In natural conversations, a speech signal is continuous, containing both overlapped and overlap-free components. This paper describes a dataset and protocols for evaluating continuous speech separation algorithms.
arXiv Detail & Related papers (2020-01-30T18:01:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.