Tuning Large language model for End-to-end Speech Translation
- URL: http://arxiv.org/abs/2310.02050v1
- Date: Tue, 3 Oct 2023 13:43:50 GMT
- Title: Tuning Large language model for End-to-end Speech Translation
- Authors: Hao Zhang, Nianwen Si, Yaqi Chen, Wenlin Zhang, Xukui Yang, Dan Qu,
Xiaolin Jiao
- Abstract summary: This paper introduces LST, a large multimodal model designed to excel at the E2E-ST task.
Experimental results on the MuST-C speech translation benchmark demonstrate that LST-13B BLEU scores of 30.39/41.55/35.33 on En-De/En-Fr/En-Es language pairs, surpassing previous models and establishing a new state-of-the-art.
- Score: 7.297914077124909
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the emergence of large language models (LLMs), multimodal models based
on LLMs have demonstrated significant potential. Models such as LLaSM, X-LLM,
and SpeechGPT exhibit an impressive ability to comprehend and generate human
instructions. However, their performance often falters when faced with complex
tasks like end-to-end speech translation (E2E-ST), a cross-language and
cross-modal translation task. In comparison to single-modal models, multimodal
models lag behind in these scenarios. This paper introduces LST, a Large
multimodal model designed to excel at the E2E-ST task. LST consists of a speech
frontend, an adapter, and a LLM backend. The training of LST consists of two
stages: (1) Modality adjustment, where the adapter is tuned to align speech
representation with text embedding space, and (2) Downstream task fine-tuning,
where both the adapter and LLM model are trained to optimize performance on the
E2EST task. Experimental results on the MuST-C speech translation benchmark
demonstrate that LST-13B achieves BLEU scores of 30.39/41.55/35.33 on
En-De/En-Fr/En-Es language pairs, surpassing previous models and establishing a
new state-of-the-art. Additionally, we conduct an in-depth analysis of
single-modal model selection and the impact of training strategies, which lays
the foundation for future research. We will open up our code and models after
review.
Related papers
- EMMA: Efficient Visual Alignment in Multi-Modal LLMs [56.03417732498859]
EMMA is a lightweight cross-modality module designed to efficiently fuse visual and textual encodings.
EMMA boosts performance across multiple tasks by up to 9.3% while significantly improving robustness against hallucinations.
arXiv Detail & Related papers (2024-10-02T23:00:31Z) - LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models [38.60622303744585]
LLaST is a framework for building high-performance Large Language model based Speech-to-text Translation systems.
Our approach includes LLM-based speech translation architecture design, ASR-augmented training, multilingual data augmentation, and dual-LoRA optimization.
arXiv Detail & Related papers (2024-07-22T06:42:00Z) - Investigating Decoder-only Large Language Models for Speech-to-text Translation [39.17113782374464]
Large language models (LLMs) are known for their exceptional reasoning capabilities, generalizability, and fluency across diverse domains.
We propose a decoder-only architecture that enables the LLM to directly consume the encoded speech representation and generate the text translation.
Our model achieves state-of-the-art performance on CoVoST 2 and FLEURS among models trained without proprietary data.
arXiv Detail & Related papers (2024-07-03T14:42:49Z) - ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets [106.7760874400261]
This paper presents ML-SUPERB2.0, which is a new benchmark for evaluating pre-trained SSL and supervised speech models.
We find performance improvements over the setup of ML-SUPERB, but performance depends on the downstream model design.
Also, we find large performance differences between languages and datasets, suggesting the need for more targeted approaches.
arXiv Detail & Related papers (2024-06-12T21:01:26Z) - Adapting Large Language Models for Document-Level Machine Translation [46.370862171452444]
Large language models (LLMs) have significantly advanced various natural language processing (NLP) tasks.
Recent research indicates that moderately-sized LLMs often outperform larger ones after task-specific fine-tuning.
This study focuses on adapting LLMs for document-level machine translation (DocMT) for specific language pairs.
arXiv Detail & Related papers (2024-01-12T09:29:13Z) - YAYI 2: Multilingual Open-Source Large Language Models [53.92832054643197]
We propose YAYI 2, including both base and chat models, with 30 billion parameters.
YAYI 2 is pre-trained from scratch on a multilingual corpus which contains 2.65 trillion tokens filtered by our pre-training data processing pipeline.
The base model is aligned with human values through supervised fine-tuning with millions of instructions and reinforcement learning from human feedback.
arXiv Detail & Related papers (2023-12-22T17:34:47Z) - Speech Translation with Large Language Models: An Industrial Practice [64.5419534101104]
We introduce LLM-ST, a novel and effective speech translation model constructed upon a pre-trained large language model (LLM)
By integrating the large language model (LLM) with a speech encoder and employing multi-task instruction tuning, LLM-ST can produce accurate timestamped transcriptions and translations.
Through rigorous experimentation on English and Chinese datasets, we showcase the exceptional performance of LLM-ST.
arXiv Detail & Related papers (2023-12-21T05:32:49Z) - Document-Level Machine Translation with Large Language Models [91.03359121149595]
Large language models (LLMs) can produce coherent, cohesive, relevant, and fluent answers for various natural language processing (NLP) tasks.
This paper provides an in-depth evaluation of LLMs' ability on discourse modeling.
arXiv Detail & Related papers (2023-04-05T03:49:06Z) - PaLM: Scaling Language Modeling with Pathways [180.69584031908113]
We trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM.
We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods.
We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks.
arXiv Detail & Related papers (2022-04-05T16:11:45Z) - ST-BERT: Cross-modal Language Model Pre-training For End-to-end Spoken
Language Understanding [23.367329217151084]
We introduce a cross-modal pre-trained language model, called Speech-Text BERT (ST-BERT), to tackle end-to-end spoken language understanding tasks.
Taking phoneme posterior and subword-level text as an input, ST-BERT learns a contextualized cross-modal alignment.
Our method shows further SLU performance gain via domain-adaptive pre-training with domain-specific speech-text pair data.
arXiv Detail & Related papers (2020-10-23T10:28:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.