Multilingual Speech Translation with Unified Transformer: Huawei Noah's
Ark Lab at IWSLT 2021
- URL: http://arxiv.org/abs/2106.00197v1
- Date: Tue, 1 Jun 2021 02:50:49 GMT
- Title: Multilingual Speech Translation with Unified Transformer: Huawei Noah's
Ark Lab at IWSLT 2021
- Authors: Xingshan Zeng, Liangyou Li and Qun Liu
- Abstract summary: This paper describes the system submitted to the IWSLT 2021 Speech Translation (MultiST) task from Huawei Noah's Ark Lab.
We use a unified transformer architecture for our MultiST model, so that the data from different modalities can be exploited to enhance the model's ability.
We apply several training techniques to improve the performance, including multi-task learning, task-level curriculum learning, data augmentation, etc.
- Score: 33.876412404781846
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper describes the system submitted to the IWSLT 2021 Multilingual
Speech Translation (MultiST) task from Huawei Noah's Ark Lab. We use a unified
transformer architecture for our MultiST model, so that the data from different
modalities (i.e., speech and text) and different tasks (i.e., Speech
Recognition, Machine Translation, and Speech Translation) can be exploited to
enhance the model's ability. Specifically, speech and text inputs are firstly
fed to different feature extractors to extract acoustic and textual features,
respectively. Then, these features are processed by a shared encoder--decoder
architecture. We apply several training techniques to improve the performance,
including multi-task learning, task-level curriculum learning, data
augmentation, etc. Our final system achieves significantly better results than
bilingual baselines on supervised language pairs and yields reasonable results
on zero-shot language pairs.
Related papers
- Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer [39.31849739010572]
We introduce textbfGenerative textbfPre-trained textbfSpeech textbfTransformer (GPST)
GPST quantizes audio waveforms into two distinct types of discrete speech representations and integrates them within a hierarchical transformer architecture.
Given a brief 3-second prompt, GPST can produce natural and coherent personalized speech, demonstrating in-context learning abilities.
arXiv Detail & Related papers (2024-06-03T04:16:30Z) - Cascaded Cross-Modal Transformer for Audio-Textual Classification [30.643750999989233]
We propose to harness the inherent value of multimodal representations by transcribing speech using automatic speech recognition (ASR) models.
We thus obtain an audio-textual (multimodal) representation for each data sample.
We were declared the winning solution in the Requests Sub-Challenge of the ACM Multimedia 2023 Computational Paralinguistics Challenge.
arXiv Detail & Related papers (2024-01-15T10:18:08Z) - Many-to-Many Spoken Language Translation via Unified Speech and Text
Representation Learning with Unit-to-Unit Translation [39.74625363642717]
We represent multilingual speech audio with speech units, the quantized representations of speech features encoded from a self-supervised speech model.
Then, we propose to train an encoder-decoder structured model with a Unit-to-Unit Translation (UTUT) objective on multilingual data.
A single pre-trained model with UTUT can be employed for diverse multilingual speech- and text-related tasks, such as Speech-to-Speech Translation (STS), multilingual Text-to-Speech Synthesis (TTS), and Text-to-Speech Translation (TTST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - VioLA: Unified Codec Language Models for Speech Recognition, Synthesis,
and Translation [91.39949385661379]
VioLA is a single auto-regressive Transformer decoder-only network that unifies various cross-modal tasks involving speech and text.
We first convert all the speech utterances to discrete tokens using an offline neural encoder.
We further integrate task IDs (TID) and language IDs (LID) into the proposed model to enhance the modeling capability of handling different languages and tasks.
arXiv Detail & Related papers (2023-05-25T14:39:47Z) - Language-agnostic Code-Switching in Sequence-To-Sequence Speech
Recognition [62.997667081978825]
Code-Switching (CS) is referred to the phenomenon of alternately using words and phrases from different languages.
We propose a simple yet effective data augmentation in which audio and corresponding labels of different source languages are transcribed.
We show that this augmentation can even improve the model's performance on inter-sentential language switches not seen during training by 5,03% WER.
arXiv Detail & Related papers (2022-10-17T12:15:57Z) - FST: the FAIR Speech Translation System for the IWSLT21 Multilingual
Shared Task [36.51221186190272]
We describe our end-to-end multilingual speech translation system submitted to the IWSLT 2021 evaluation campaign.
Our system is built by leveraging transfer learning across modalities, tasks and languages.
arXiv Detail & Related papers (2021-07-14T19:43:44Z) - ESPnet-ST IWSLT 2021 Offline Speech Translation System [56.83606198051871]
This paper describes the ESPnet-ST group's IWSLT 2021 submission in the offline speech translation track.
This year we made various efforts on training data, architecture, and audio segmentation.
Our best E2E system combined all the techniques with model ensembling and achieved 31.4 BLEU.
arXiv Detail & Related papers (2021-07-01T17:49:43Z) - Efficient Weight factorization for Multilingual Speech Recognition [67.00151881207792]
End-to-end multilingual speech recognition involves using a single model training on a compositional speech corpus including many languages.
Due to the fact that each language in the training data has different characteristics, the shared network may struggle to optimize for all various languages simultaneously.
We propose a novel multilingual architecture that targets the core operation in neural networks: linear transformation functions.
arXiv Detail & Related papers (2021-05-07T00:12:02Z) - Dual-decoder Transformer for Joint Automatic Speech Recognition and
Multilingual Speech Translation [71.54816893482457]
We introduce dual-decoder Transformer, a new model architecture that jointly performs automatic speech recognition (ASR) and multilingual speech translation (ST)
Our models are based on the original Transformer architecture but consist of two decoders, each responsible for one task (ASR or ST)
arXiv Detail & Related papers (2020-11-02T04:59:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.