Dealing with training and test segmentation mismatch: FBK@IWSLT2021
- URL: http://arxiv.org/abs/2106.12607v1
- Date: Wed, 23 Jun 2021 18:11:32 GMT
- Title: Dealing with training and test segmentation mismatch: FBK@IWSLT2021
- Authors: Sara Papi, Marco Gaido, Matteo Negri, Marco Turchi
- Abstract summary: This paper describes FBK's system submission to the IWSLT 2021 Offline Speech Translation task.
It is a Transformer-based architecture trained to translate English speech audio data into German texts.
The training pipeline is characterized by knowledge distillation and a two-step fine-tuning procedure.
- Score: 13.89298686257514
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper describes FBK's system submission to the IWSLT 2021 Offline Speech
Translation task. We participated with a direct model, which is a
Transformer-based architecture trained to translate English speech audio data
into German texts. The training pipeline is characterized by knowledge
distillation and a two-step fine-tuning procedure. Both knowledge distillation
and the first fine-tuning step are carried out on manually segmented real and
synthetic data, the latter being generated with an MT system trained on the
available corpora. Differently, the second fine-tuning step is carried out on a
random segmentation of the MuST-C v2 En-De dataset. Its main goal is to reduce
the performance drops occurring when a speech translation model trained on
manually segmented data (i.e. an ideal, sentence-like segmentation) is
evaluated on automatically segmented audio (i.e. actual, more realistic testing
conditions). For the same purpose, a custom hybrid segmentation procedure that
accounts for both audio content (pauses) and for the length of the produced
segments is applied to the test data before passing them to the system. At
inference time, we compared this procedure with a baseline segmentation method
based on Voice Activity Detection (VAD). Our results indicate the effectiveness
of the proposed hybrid approach, shown by a reduction of the gap with manual
segmentation from 8.3 to 1.4 BLEU points.
Related papers
- Prosody in Cascade and Direct Speech-to-Text Translation: a case study
on Korean Wh-Phrases [79.07111754406841]
This work proposes using contrastive evaluation to measure the ability of direct S2TT systems to disambiguate utterances where prosody plays a crucial role.
Our results clearly demonstrate the value of direct translation systems over cascade translation models.
arXiv Detail & Related papers (2024-02-01T14:46:35Z) - SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural
Machine Translation [51.881877192924414]
Sub-word segmentation is an essential pre-processing step for Neural Machine Translation (NMT)
This paper introduces SelfSeg, a self-supervised neural sub-word segmentation method.
SelfSeg is much faster to train/decode and requires only monolingual dictionaries instead of parallel corpora.
arXiv Detail & Related papers (2023-07-31T04:38:47Z) - Code-Switching Text Generation and Injection in Mandarin-English ASR [57.57570417273262]
We investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T)
We first propose a strategy to generate code-switching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces.
Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models.
arXiv Detail & Related papers (2023-03-20T09:13:27Z) - UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units [64.61596752343837]
We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units.
We enhance the model performance by subword prediction in the first-pass decoder.
We show that the proposed methods boost the performance even when predicting spectrogram in the second pass.
arXiv Detail & Related papers (2022-12-15T18:58:28Z) - M3ST: Mix at Three Levels for Speech Translation [66.71994367650461]
We propose Mix at three levels for Speech Translation (M3ST) method to increase the diversity of the augmented training corpus.
In the first stage of fine-tuning, we mix the training corpus at three levels, including word level, sentence level and frame level, and fine-tune the entire model with mixed data.
Experiments on MuST-C speech translation benchmark and analysis show that M3ST outperforms current strong baselines and achieves state-of-the-art results on eight directions with an average BLEU of 29.9.
arXiv Detail & Related papers (2022-12-07T14:22:00Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - SHAS: Approaching optimal Segmentation for End-to-End Speech Translation [0.0]
Speech translation models are unable to directly process long audios, like TED talks, which have to be split into shorter segments.
We propose Supervised Hybrid Audio (SHAS), a method that can effectively learn the optimal segmentation from any manually segmented speech corpus.
Experiments on MuST-C and mTEDx show that SHAS retains 95-98% of the manual segmentation's BLEU score, compared to the 87-93% of the best existing methods.
arXiv Detail & Related papers (2022-02-09T23:55:25Z) - Beyond Voice Activity Detection: Hybrid Audio Segmentation for Direct
Speech Translation [14.151063458445826]
We show that our methods outperform all the other techniques, reducing by at least 30% the gap between the traditional VAD-based approach and optimal manual segmentation.
arXiv Detail & Related papers (2021-04-23T16:54:13Z) - Contextualized Translation of Automatically Segmented Speech [20.334746967390164]
We train our models on randomly segmented data and compare two approaches: fine-tuning and adding the previous segment as context.
Our solution is more robust to VAD-segmented input, outperforming a strong base model and the fine-tuning on different VAD segmentations of an English-German test set by up to 4.25 BLEU points.
arXiv Detail & Related papers (2020-08-05T17:52:25Z) - End-to-End Speech-Translation with Knowledge Distillation: FBK@IWSLT2020 [20.456325305495966]
This paper describes FBK's participation in the IWSLT 2020 offline speech translation (ST) task.
The task evaluates systems' ability to translate English TED talks audio into German texts.
Our system is an end-to-end model based on an adaptation of the Transformer for speech data.
arXiv Detail & Related papers (2020-06-04T15:47:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.