LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models
- URL: http://arxiv.org/abs/2407.15415v1
- Date: Mon, 22 Jul 2024 06:42:00 GMT
- Title: LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models
- Authors: Xi Chen, Songyang Zhang, Qibing Bai, Kai Chen, Satoshi Nakamura,
- Abstract summary: LLaST is a framework for building high-performance Large Language model based Speech-to-text Translation systems.
Our approach includes LLM-based speech translation architecture design, ASR-augmented training, multilingual data augmentation, and dual-LoRA optimization.
- Score: 38.60622303744585
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduces LLaST, a framework for building high-performance Large Language model based Speech-to-text Translation systems. We address the limitations of end-to-end speech translation(E2E ST) models by exploring model architecture design and optimization techniques tailored for LLMs. Our approach includes LLM-based speech translation architecture design, ASR-augmented training, multilingual data augmentation, and dual-LoRA optimization. Our approach demonstrates superior performance on the CoVoST-2 benchmark and showcases exceptional scaling capabilities powered by LLMs. We believe this effective method will serve as a strong baseline for speech translation and provide insights for future improvements of the LLM-based speech translation framework. We release the data, code and models in https://github.com/openaudiolab/LLaST.
Related papers
- Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback [50.84142264245052]
This work introduces the Align-SLM framework to enhance the semantic understanding of textless Spoken Language Models (SLMs)
Our approach generates multiple speech continuations from a given prompt and uses semantic metrics to create preference data for Direct Preference Optimization (DPO)
We evaluate the framework using ZeroSpeech 2021 benchmarks for lexical and syntactic modeling, the spoken version of the StoryCloze dataset for semantic coherence, and other speech generation metrics, including the GPT4-o score and human evaluation.
arXiv Detail & Related papers (2024-11-04T06:07:53Z) - Investigating Decoder-only Large Language Models for Speech-to-text Translation [39.17113782374464]
Large language models (LLMs) are known for their exceptional reasoning capabilities, generalizability, and fluency across diverse domains.
We propose a decoder-only architecture that enables the LLM to directly consume the encoded speech representation and generate the text translation.
Our model achieves state-of-the-art performance on CoVoST 2 and FLEURS among models trained without proprietary data.
arXiv Detail & Related papers (2024-07-03T14:42:49Z) - Building Accurate Translation-Tailored LLMs with Language Aware Instruction Tuning [57.323716555996114]
Off-target translation remains an unsolved problem, especially for low-resource languages.
Recent works have either designed advanced prompting strategies to highlight the functionality of translation instructions or exploited the in-context learning ability of LLMs.
In this work, we design a two-stage fine-tuning algorithm to improve the instruction-following ability (especially the translation direction) of LLMs.
arXiv Detail & Related papers (2024-03-21T13:47:40Z) - Boosting Large Language Model for Speech Synthesis: An Empirical Study [86.89548753080432]
Large language models (LLMs) have made significant advancements in natural language processing and are concurrently extending the language ability to other modalities, such as speech and vision.
We conduct a comprehensive empirical exploration of boosting LLMs with the ability to generate speech, by combining pre-trained LLM LLaMA/OPT and text-to-speech synthesis model VALL-E.
We compare three integration methods between LLMs and speech models, including directly fine-tuned LLMs, superposed layers of LLMs and VALL-E, and coupled LLMs and VALL-E using LLMs as a powerful text encoder
arXiv Detail & Related papers (2023-12-30T14:20:04Z) - Speech Translation with Large Language Models: An Industrial Practice [64.5419534101104]
We introduce LLM-ST, a novel and effective speech translation model constructed upon a pre-trained large language model (LLM)
By integrating the large language model (LLM) with a speech encoder and employing multi-task instruction tuning, LLM-ST can produce accurate timestamped transcriptions and translations.
Through rigorous experimentation on English and Chinese datasets, we showcase the exceptional performance of LLM-ST.
arXiv Detail & Related papers (2023-12-21T05:32:49Z) - Tuning Large language model for End-to-end Speech Translation [7.297914077124909]
This paper introduces LST, a large multimodal model designed to excel at the E2E-ST task.
Experimental results on the MuST-C speech translation benchmark demonstrate that LST-13B BLEU scores of 30.39/41.55/35.33 on En-De/En-Fr/En-Es language pairs, surpassing previous models and establishing a new state-of-the-art.
arXiv Detail & Related papers (2023-10-03T13:43:50Z) - mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs [50.17767479660832]
Vision-language models (Vision-LLMs) align pretrained image encoders with (frozen) large language models (LLMs) and post-hoc condition LLMs to understand' the image input.
We present mBLIP, the first Vision-LLM leveraging multilingual LLMs, which we obtain in a computationally efficient manner on consumer-level hardware.
arXiv Detail & Related papers (2023-07-13T17:51:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.