End-to-End Speech-Translation with Knowledge Distillation: FBK@IWSLT2020
- URL: http://arxiv.org/abs/2006.02965v1
- Date: Thu, 4 Jun 2020 15:47:47 GMT
- Title: End-to-End Speech-Translation with Knowledge Distillation: FBK@IWSLT2020
- Authors: Marco Gaido, Mattia Antonino Di Gangi, Matteo Negri, Marco Turchi
- Abstract summary: This paper describes FBK's participation in the IWSLT 2020 offline speech translation (ST) task.
The task evaluates systems' ability to translate English TED talks audio into German texts.
Our system is an end-to-end model based on an adaptation of the Transformer for speech data.
- Score: 20.456325305495966
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper describes FBK's participation in the IWSLT 2020 offline speech
translation (ST) task. The task evaluates systems' ability to translate English
TED talks audio into German texts. The test talks are provided in two versions:
one contains the data already segmented with automatic tools and the other is
the raw data without any segmentation. Participants can decide whether to work
on custom segmentation or not. We used the provided segmentation. Our system is
an end-to-end model based on an adaptation of the Transformer for speech data.
Its training process is the main focus of this paper and it is based on: i)
transfer learning (ASR pretraining and knowledge distillation), ii) data
augmentation (SpecAugment, time stretch and synthetic data), iii) combining
synthetic and real data marked as different domains, and iv) multi-task
learning using the CTC loss. Finally, after the training with word-level
knowledge distillation is complete, our ST models are fine-tuned using label
smoothed cross entropy. Our best model scored 29 BLEU on the MuST-C En-De test
set, which is an excellent result compared to recent papers, and 23.7 BLEU on
the same data segmented with VAD, showing the need for researching solutions
addressing this specific data condition.
Related papers
- Improving End-to-End Speech Processing by Efficient Text Data
Utilization with Latent Synthesis [17.604583337593677]
Training a high performance end-to-end speech (E2E) processing model requires an enormous amount of labeled speech data.
We propose Latent Synthesis (LaSyn), an efficient textual data utilization framework for E2E speech processing models.
arXiv Detail & Related papers (2023-10-09T03:10:49Z) - M3ST: Mix at Three Levels for Speech Translation [66.71994367650461]
We propose Mix at three levels for Speech Translation (M3ST) method to increase the diversity of the augmented training corpus.
In the first stage of fine-tuning, we mix the training corpus at three levels, including word level, sentence level and frame level, and fine-tune the entire model with mixed data.
Experiments on MuST-C speech translation benchmark and analysis show that M3ST outperforms current strong baselines and achieves state-of-the-art results on eight directions with an average BLEU of 29.9.
arXiv Detail & Related papers (2022-12-07T14:22:00Z) - BJTU-WeChat's Systems for the WMT22 Chat Translation Task [66.81525961469494]
This paper introduces the joint submission of the Beijing Jiaotong University and WeChat AI to the WMT'22 chat translation task for English-German.
Based on the Transformer, we apply several effective variants.
Our systems achieve 0.810 and 0.946 COMET scores.
arXiv Detail & Related papers (2022-11-28T02:35:04Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - A study on the efficacy of model pre-training in developing neural
text-to-speech system [55.947807261757056]
This study aims to understand better why and how model pre-training can positively contribute to TTS system performance.
It is found that the TTS system could achieve comparable performance when the pre-training data is reduced to 1/8 of its original size.
arXiv Detail & Related papers (2021-10-08T02:09:28Z) - Dealing with training and test segmentation mismatch: FBK@IWSLT2021 [13.89298686257514]
This paper describes FBK's system submission to the IWSLT 2021 Offline Speech Translation task.
It is a Transformer-based architecture trained to translate English speech audio data into German texts.
The training pipeline is characterized by knowledge distillation and a two-step fine-tuning procedure.
arXiv Detail & Related papers (2021-06-23T18:11:32Z) - Exploring Transfer Learning For End-to-End Spoken Language Understanding [8.317084844841323]
An end-to-end (E2E) system that goes directly from speech to a hypothesis is a more attractive option.
We propose an E2E system that is designed to jointly train on multiple speech-to-text tasks.
We show that it beats the performance of E2E models trained on individual tasks.
arXiv Detail & Related papers (2020-12-15T19:02:15Z) - Phonemer at WNUT-2020 Task 2: Sequence Classification Using COVID
Twitter BERT and Bagging Ensemble Technique based on Plurality Voting [0.0]
We develop a system that automatically identifies whether an English Tweet related to the novel coronavirus (COVID-19) is informative or not.
Our final approach achieved an F1-score of 0.9037 and we were ranked sixth overall with F1-score as the evaluation criteria.
arXiv Detail & Related papers (2020-10-01T10:54:54Z) - Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language.
We generate abstractive summaries of narrated instructional videos across a wide variety of topics.
We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z) - You Do Not Need More Data: Improving End-To-End Speech Recognition by
Text-To-Speech Data Augmentation [59.31769998728787]
We build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model.
Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
arXiv Detail & Related papers (2020-05-14T17:24:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.