Related papers: SEAL: Speech Embedding Alignment Learning for Speech Large Language Model with Retrieval-Augmented Generation

SEAL: Speech Embedding Alignment Learning for Speech Large Language Model with Retrieval-Augmented Generation

URL: http://arxiv.org/abs/2502.02603v1
Date: Sun, 26 Jan 2025 15:04:02 GMT
Title: SEAL: Speech Embedding Alignment Learning for Speech Large Language Model with Retrieval-Augmented Generation
Authors: Chunyu Sun, Bingyu Liu, Zhichao Cui, Anbin Qi, Tian-hao Zhang, Dinghao Zhou, Lewei Lu,
Abstract summary: We propose a unified embedding framework that eliminates the need for intermediate text representations.<n>Our model reduces pipeline latency by 50% while achieving higher retrieval accuracy compared to traditional two-stage methods.
Score: 10.828717295018123
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Embedding-based retrieval models have made significant strides in retrieval-augmented generation (RAG) techniques for text and multimodal large language models (LLMs) applications. However, when it comes to speech larage language models (SLLMs), these methods are limited to a two-stage process, where automatic speech recognition (ASR) is combined with text-based retrieval. This sequential architecture suffers from high latency and error propagation. To address these limitations, we propose a unified embedding framework that eliminates the need for intermediate text representations. Specifically, the framework includes separate speech and text encoders, followed by a shared scaling layer that maps both modalities into a common embedding space. Our model reduces pipeline latency by 50\% while achieving higher retrieval accuracy compared to traditional two-stage methods. We also provide a theoretical analysis of the challenges inherent in end-to-end speech retrieval and introduce architectural principles for effective speech-to-document matching. Extensive experiments demonstrate the robustness of our approach across diverse acoustic conditions and speaker variations, paving the way for a new paradigm in multimodal SLLMs retrieval systems.

Related papers

OmniDRCA: Parallel Speech-Text Foundation Model via Dual-Resolution Speech Representations and Contrastive Alignment [48.17593420058064]
We present OmniDRCA, a parallel speech-text foundation model based on joint autoregressive modeling.<n>Our approach processes speech and text representations parallel while enhancing audio comprehension through contrastive alignment.
arXiv Detail & Related papers (2025-06-11T02:57:22Z)
Towards Efficient Speech-Text Jointly Decoding within One Speech Language Model [76.06585781346601]
Speech language models (Speech LMs) enable end-to-end speech-text modelling within a single model.<n>The choice of speech-text jointly decoding paradigm plays a critical role in performance, efficiency, and alignment quality.
arXiv Detail & Related papers (2025-06-04T23:53:49Z)
TESU-LLM: Training Speech-LLMs Without Speech via Unified Encoder Alignment [15.899112804399193]
We present textbfTESU-LLM, a novel framework that enables training speech-capable language models using only text data.<n>Our key insight is to leverage a unified encoder that maps semantically equivalent text and speech inputs to a shared latent space.<n>Despite being trained exclusively on text, TESU-LLM achieves strong performance on various speech-related benchmarks.
arXiv Detail & Related papers (2025-06-01T09:27:55Z)
CMT-LLM: Contextual Multi-Talker ASR Utilizing Large Language Models [23.278483193586887]
We propose a unified framework that combines multi-talker overlapping speech recognition and contextual biasing into a single task.<n>Our approach outperforms traditional contextual biasing methods, achieving a WER of 7.9% on LibriMix and 32.9% on AMI SDM.
arXiv Detail & Related papers (2025-05-31T07:26:44Z)
GOAT-TTS: LLM-based Text-To-Speech Generation Optimized via A Dual-Branch Architecture [12.303324248639266]
We propose a text-to-speech generation approach optimized via a novel dual-branch ArchiTecture (GOAT-TTS) GOAT-TTS combines a speech encoder and projector to capture continuous acoustic embeddings, enabling bidirectional correlation between paralinguistic features (language, timbre, emotion) and semantic text representations without transcript dependency. Experimental results demonstrate that our GOAT-TTS achieves performance comparable to state-of-the-art TTS models.
arXiv Detail & Related papers (2025-04-15T01:44:56Z)
Advancing Multi-talker ASR Performance with Large Language Models [48.52252970956368]
Recognizing overlapping speech from multiple speakers in conversational scenarios is one of the most challenging problem for automatic speech recognition (ASR) In this paper, we propose an LLM-based SOT approach for multi-talker ASR, leveraging pre-trained speech encoder and LLM. Our approach surpasses traditional AED-based methods on the simulated dataset LibriMix and achieves state-of-the-art performance on the evaluation set of the real-world dataset AMI.
arXiv Detail & Related papers (2024-08-30T17:29:25Z)
Multi-Modal Retrieval For Large Language Model Based Speech Recognition [15.494654232953678]
We propose multi-modal retrieval with two approaches: kNN-LM and cross-attention techniques. We show that speech-based multi-modal retrieval outperforms text based retrieval. We achieve state-of-the-art recognition results on the Spoken-Squad question answering dataset.
arXiv Detail & Related papers (2024-06-13T22:55:22Z)
Fine-tuning CLIP Text Encoders with Two-step Paraphrasing [83.3736789315201]
We introduce a straightforward fine-tuning approach to enhance the representations of CLIP models for paraphrases. Our model, which we call ParaCLIP, exhibits significant improvements over baseline CLIP models across various tasks.
arXiv Detail & Related papers (2024-02-23T06:11:50Z)
Improved Contextual Recognition In Automatic Speech Recognition Systems By Semantic Lattice Rescoring [4.819085609772069]
We propose a novel approach for enhancing contextual recognition within ASR systems via semantic lattice processing. Our solution consists of using Hidden Markov Models and Gaussian Mixture Models (HMM-GMM) along with Deep Neural Networks (DNN) models for better accuracy. We demonstrate the effectiveness of our proposed framework on the LibriSpeech dataset with empirical analyses.
arXiv Detail & Related papers (2023-10-14T23:16:05Z)
High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations. Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z)
Learning Speech Representation From Contrastive Token-Acoustic Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space. The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z)
Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding [57.42429912884543]
We propose Diff-LM-Speech, Tetra-Diff-Speech and Tri-Diff-Speech to solve high dimensionality and waveform distortion problems. We also introduce a prompt encoder structure based on a variational autoencoder and a prosody bottleneck to improve prompt representation ability. Experimental results show that our proposed methods outperform baseline methods.
arXiv Detail & Related papers (2023-07-28T11:20:23Z)
The Whole Truth and Nothing But the Truth: Faithful and Controllable Dialogue Response Generation with Dataflow Transduction and Constrained Decoding [65.34601470417967]
We describe a hybrid architecture for dialogue response generation that combines the strengths of neural language modeling and rule-based generation. Our experiments show that this system outperforms both rule-based and learned approaches in human evaluations of fluency, relevance, and truthfulness.
arXiv Detail & Related papers (2022-09-16T09:00:49Z)
SPLAT: Speech-Language Joint Pre-Training for Spoken Language Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions. Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text. We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.