End-to-end Contrastive Language-Speech Pretraining Model For Long-form Spoken Question Answering
- URL: http://arxiv.org/abs/2511.09282v1
- Date: Thu, 13 Nov 2025 01:44:23 GMT
- Title: End-to-end Contrastive Language-Speech Pretraining Model For Long-form Spoken Question Answering
- Authors: Jiliang Hu, Zuchao Li, Baoyuan Qi, Liu Guoming, Ping Wang,
- Abstract summary: CLSR is an end-to-end contrastive language-speech retriever.<n>It efficiently extracts question-relevant segments from long audio recordings for downstream SQA task.
- Score: 33.675277272634666
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Significant progress has been made in spoken question answering (SQA) in recent years. However, many existing methods, including large audio language models, struggle with processing long audio. Follow the success of retrieval augmented generation, a speech-related retriever shows promising in help preprocessing long-form speech. But the performance of existing speech-related retrievers is lacking. To address this challenge, we propose CLSR, an end-to-end contrastive language-speech retriever that efficiently extracts question-relevant segments from long audio recordings for downstream SQA task. Unlike conventional speech-text contrastive models, CLSR incorporates an intermediate step that converts acoustic features into text-like representations prior to alignment, thereby more effectively bridging the gap between modalities. Experimental results across four cross-modal retrieval datasets demonstrate that CLSR surpasses both end-to-end speech related retrievers and pipeline approaches combining speech recognition with text retrieval, providing a robust foundation for advancing practical long-form SQA applications.
Related papers
- MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance [66.74042564585942]
MOSS-Speech is a true speech-to-speech large language model that directly understands and generates speech without relying on text guidance.<n>Our work establishes a new paradigm for expressive and efficient end-to-end speech interaction.
arXiv Detail & Related papers (2025-10-01T04:32:37Z) - FastLongSpeech: Enhancing Large Speech-Language Models for Efficient Long-Speech Processing [64.80398769313065]
FastLongSpeech is designed to extend LSLM capabilities for efficient long-speech processing.<n>It incorporates an iterative fusion strategy that can compress excessively long-speech sequences into manageable lengths.<n>Our method exhibits strong performance in both long-speech and short-speech tasks, while greatly improving inference efficiency.
arXiv Detail & Related papers (2025-07-20T04:11:06Z) - Towards Efficient Speech-Text Jointly Decoding within One Speech Language Model [76.06585781346601]
Speech language models (Speech LMs) enable end-to-end speech-text modelling within a single model.<n>The choice of speech-text jointly decoding paradigm plays a critical role in performance, efficiency, and alignment quality.
arXiv Detail & Related papers (2025-06-04T23:53:49Z) - Speech Retrieval-Augmented Generation without Automatic Speech Recognition [4.731446054087683]
SpeechRAG is a novel framework designed for open-question answering over spoken data.<n>Our proposed approach fine-tunes a pre-trained speech encoder into a speech adapter fed into a frozen large language model.<n>By aligning the embedding spaces of text and speech, our speech retriever directly retrieves audio passages from text-based queries.
arXiv Detail & Related papers (2024-12-21T06:16:04Z) - VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning [64.56272011710735]
We propose a novel single-stage joint speech-text SFT approach on the low-rank adaptation (LoRA) of the large language models (LLMs) backbone.<n>Compared to previous SpeechLMs with 7B or 13B parameters, our 3B model demonstrates superior performance across various speech benchmarks.
arXiv Detail & Related papers (2024-10-23T00:36:06Z) - Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems [55.99999020778169]
We study a function that can predict the forthcoming words and estimate the time remaining until the end of an utterance.
We develop a cross-attention-based algorithm that incorporates both acoustic and linguistic information.
Results demonstrate the proposed model's ability to predict upcoming words and estimate future EOU events up to 300ms prior to the actual EOU.
arXiv Detail & Related papers (2024-09-30T06:29:58Z) - A Multimodal Dense Retrieval Approach for Speech-Based Open-Domain Question Answering [16.613985687431818]
Passage retrieval is a key task in speech-based open-domain QA.
We propose an end-to-end trained multimodal dense retriever that can work directly on spoken questions.
arXiv Detail & Related papers (2024-09-20T13:15:53Z) - AudioChatLlama: Towards General-Purpose Speech Abilities for LLMs [27.122094554340194]
We extend the instruction-tuned Llama-2 model with end-to-end general-purpose speech processing and reasoning abilities.
The resulting end-to-end model, named AudioChatLlama, can utilize audio prompts as a replacement for text and sustain a conversation.
arXiv Detail & Related papers (2023-11-12T06:56:14Z) - SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding
Tasks [88.4408774253634]
Spoken language understanding (SLU) tasks have been studied for many decades in the speech research community.
There are not nearly as many SLU task benchmarks, and many of the existing ones use data that is not freely available to all researchers.
Recent work has begun to introduce such benchmark for several tasks.
arXiv Detail & Related papers (2022-12-20T18:39:59Z) - On the Impact of Speech Recognition Errors in Passage Retrieval for
Spoken Question Answering [13.013751306590303]
We study the robustness of lexical and dense retrievers against questions with synthetic ASR noise.
We create a new dataset with questions voiced by human users and use their transcriptions to show that the retrieval performance can further degrade when dealing with natural ASR noise instead of synthetic ASR noise.
arXiv Detail & Related papers (2022-09-26T18:29:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.