Adaptive Inner Speech-Text Alignment for LLM-based Speech Translation
- URL: http://arxiv.org/abs/2503.10211v1
- Date: Thu, 13 Mar 2025 09:54:35 GMT
- Title: Adaptive Inner Speech-Text Alignment for LLM-based Speech Translation
- Authors: Henglyu Liu, Andong Chen, Kehai Chen, Xuefeng Bai, Meizhi Zhong, Yuan Qiu, Min Zhang,
- Abstract summary: We propose an Adaptive Inner Speech-Text Alignment (AI-STA) method to bridge the modality gap by explicitly aligning speech and text representations at selected layers within large language models (LLMs)<n> Experimental results on speech translation tasks demonstrate that AI-STA significantly improves the translation performance of large speech-text models (LSMs), outperforming previous state-of-the-art approaches.
- Score: 20.415410280412697
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancement of large language models (LLMs) has led to significant breakthroughs across various tasks, laying the foundation for the development of LLM-based speech translation systems. Existing methods primarily focus on aligning inputs and outputs across modalities while overlooking deeper semantic alignment within model representations. To address this limitation, we propose an Adaptive Inner Speech-Text Alignment (AI-STA) method to bridge the modality gap by explicitly aligning speech and text representations at selected layers within LLMs. To achieve this, we leverage the optimal transport (OT) theory to quantify fine-grained representation discrepancies between speech and text. Furthermore, we utilize the cross-modal retrieval technique to identify the layers that are best suited for alignment and perform joint training on these layers. Experimental results on speech translation (ST) tasks demonstrate that AI-STA significantly improves the translation performance of large speech-text models (LSMs), outperforming previous state-of-the-art approaches. Our findings highlight the importance of inner-layer speech-text alignment in LLMs and provide new insights into enhancing cross-modal learning.
Related papers
- Text-Speech Language Models with Improved Cross-Modal Transfer by Aligning Abstraction Levels [12.179720395334503]
We propose augmenting vocabulary expansion with modules that better align abstraction levels across layers.<n>Our models, textscSmolTolk, rival or surpass state-of-the-art TSLMs trained with orders of magnitude more compute.
arXiv Detail & Related papers (2025-03-08T13:28:50Z) - SparQLe: Speech Queries to Text Translation Through LLMs [0.8901073744693314]
This study introduces a novel approach that leverages self-supervised speech representations in combination with instruction-tuned LLMs for speech-to-text translation.<n>Our experiments demonstrate that this method effectively preserves the semantic content of the input speech and serves as an effective bridge between self-supervised speech models and instruction-tuned LLMs.
arXiv Detail & Related papers (2025-02-13T12:57:15Z) - AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding [63.09928907734156]
AlignVLM is a vision-text alignment method that maps visual features to a weighted average of text embeddings.<n>Our experiments show that AlignVLM achieves state-of-the-art performance compared to prior alignment methods.
arXiv Detail & Related papers (2025-02-03T13:34:51Z) - SEAL: Speech Embedding Alignment Learning for Speech Large Language Model with Retrieval-Augmented Generation [10.828717295018123]
We propose a unified embedding framework that eliminates the need for intermediate text representations.<n>Our model reduces pipeline latency by 50% while achieving higher retrieval accuracy compared to traditional two-stage methods.
arXiv Detail & Related papers (2025-01-26T15:04:02Z) - Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback [50.84142264245052]
This work introduces the Align-SLM framework to enhance the semantic understanding of textless Spoken Language Models (SLMs)
Our approach generates multiple speech continuations from a given prompt and uses semantic metrics to create preference data for Direct Preference Optimization (DPO)
We evaluate the framework using ZeroSpeech 2021 benchmarks for lexical and syntactic modeling, the spoken version of the StoryCloze dataset for semantic coherence, and other speech generation metrics, including the GPT4-o score and human evaluation.
arXiv Detail & Related papers (2024-11-04T06:07:53Z) - Mind the Gap: A Generalized Approach for Cross-Modal Embedding Alignment [0.0]
Retrieval-Augmented Generation (RAG) systems retrieve context across different text modalities due to semantic gaps.
We introduce a generalized projection-based method, inspired by adapter modules in transfer learning, that efficiently bridges these gaps.
Our approach emphasizes speed, accuracy, and data efficiency, requiring minimal resources for training and inference.
arXiv Detail & Related papers (2024-10-30T20:28:10Z) - Self-Powered LLM Modality Expansion for Large Speech-Text Models [62.27700381806554]
Large language models (LLMs) exhibit remarkable performance across diverse tasks.
This study aims to refine the use of speech datasets for LSM training by addressing the limitations of vanilla instruction tuning.
We introduce a self-powered LSM that leverages augmented automatic speech recognition data generated by the model itself for more effective instruction tuning.
arXiv Detail & Related papers (2024-10-04T04:34:24Z) - Towards ASR Robust Spoken Language Understanding Through In-Context
Learning With Word Confusion Networks [68.79880423713597]
We introduce a method that utilizes the ASR system's lattice output instead of relying solely on the top hypothesis.
Our in-context learning experiments, covering spoken question answering and intent classification, underline the LLM's resilience to noisy speech transcripts.
arXiv Detail & Related papers (2024-01-05T17:58:10Z) - Generative Context-aware Fine-tuning of Self-supervised Speech Models [54.389711404209415]
We study the use of generative large language models (LLM) generated context information.
We propose an approach to distill the generated information during fine-tuning of self-supervised speech models.
We evaluate the proposed approach using the SLUE and Libri-light benchmarks for several downstream tasks: automatic speech recognition, named entity recognition, and sentiment analysis.
arXiv Detail & Related papers (2023-12-15T15:46:02Z) - Large Language Models can Contrastively Refine their Generation for Better Sentence Representation Learning [57.74233319453229]
Large language models (LLMs) have emerged as a groundbreaking technology and their unparalleled text generation capabilities have sparked interest in their application to the fundamental sentence representation learning task.
We propose MultiCSR, a multi-level contrastive sentence representation learning framework that decomposes the process of prompting LLMs to generate a corpus.
Our experiments reveal that MultiCSR enables a less advanced LLM to surpass the performance of ChatGPT, while applying it to ChatGPT achieves better state-of-the-art results.
arXiv Detail & Related papers (2023-10-17T03:21:43Z) - ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and
Intra-modal Knowledge Integration [48.01536973731182]
We introduce a new vision-and-language pretraining method called ROSITA.
It integrates the cross- and intra-modal knowledge in a unified scene graph to enhance the semantic alignments.
ROSITA significantly outperforms existing state-of-the-art methods on three typical vision-and-language tasks over six benchmark datasets.
arXiv Detail & Related papers (2021-08-16T13:16:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.