SegAugment: Maximizing the Utility of Speech Translation Data with
Segmentation-based Augmentations
- URL: http://arxiv.org/abs/2212.09699v3
- Date: Wed, 1 Nov 2023 14:18:40 GMT
- Title: SegAugment: Maximizing the Utility of Speech Translation Data with
Segmentation-based Augmentations
- Authors: Ioannis Tsiamas, Jos\'e A. R. Fonollosa, Marta R. Costa-juss\`a
- Abstract summary: End-to-end Speech Translation is hindered by a lack of available data resources.
We propose a new data augmentation strategy, SegAugment, to address this issue.
We show that the proposed method can also successfully augment sentence-level datasets.
- Score: 2.535399238341164
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: End-to-end Speech Translation is hindered by a lack of available data
resources. While most of them are based on documents, a sentence-level version
is available, which is however single and static, potentially impeding the
usefulness of the data. We propose a new data augmentation strategy,
SegAugment, to address this issue by generating multiple alternative
sentence-level versions of a dataset. Our method utilizes an Audio Segmentation
system, which re-segments the speech of each document with different length
constraints, after which we obtain the target text via alignment methods.
Experiments demonstrate consistent gains across eight language pairs in MuST-C,
with an average increase of 2.5 BLEU points, and up to 5 BLEU for low-resource
scenarios in mTEDx. Furthermore, when combined with a strong system, SegAugment
establishes new state-of-the-art results in MuST-C. Finally, we show that the
proposed method can also successfully augment sentence-level datasets, and that
it enables Speech Translation models to close the gap between the manual and
automatic segmentation at inference time.
Related papers
- Maximizing Data Efficiency for Cross-Lingual TTS Adaptation by
Self-Supervised Representation Mixing and Embedding Initialization [57.38123229553157]
This paper presents an effective transfer learning framework for language adaptation in text-to-speech systems.
We focus on achieving language adaptation using minimal labeled and unlabeled data.
Experimental results show that our framework is able to synthesize intelligible speech in unseen languages with only 4 utterances of labeled data and 15 minutes of unlabeled data.
arXiv Detail & Related papers (2024-01-23T21:55:34Z) - Long-Form End-to-End Speech Translation via Latent Alignment
Segmentation [6.153530338207679]
Current simultaneous speech translation models can process audio only up to a few seconds long.
We propose a novel segmentation approach for a low-latency end-to-end speech translation.
We show that the proposed approach achieves state-of-the-art quality at no additional computational cost.
arXiv Detail & Related papers (2023-09-20T15:10:12Z) - Simple and Effective Unsupervised Speech Translation [68.25022245914363]
We study a simple and effective approach to build speech translation systems without labeled data.
We present an unsupervised domain adaptation technique for pre-trained speech models.
Experiments show that unsupervised speech-to-text translation outperforms the previous unsupervised state of the art.
arXiv Detail & Related papers (2022-10-18T22:26:13Z) - Discrete Cross-Modal Alignment Enables Zero-Shot Speech Translation [71.35243644890537]
End-to-end Speech Translation (ST) aims at translating the source language speech into target language text without generating the intermediate transcriptions.
Existing zero-shot methods fail to align the two modalities of speech and text into a shared semantic space.
We propose a novel Discrete Cross-Modal Alignment (DCMA) method that employs a shared discrete vocabulary space to accommodate and match both modalities of speech and text.
arXiv Detail & Related papers (2022-10-18T03:06:47Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - SHAS: Approaching optimal Segmentation for End-to-End Speech Translation [0.0]
Speech translation models are unable to directly process long audios, like TED talks, which have to be split into shorter segments.
We propose Supervised Hybrid Audio (SHAS), a method that can effectively learn the optimal segmentation from any manually segmented speech corpus.
Experiments on MuST-C and mTEDx show that SHAS retains 95-98% of the manual segmentation's BLEU score, compared to the 87-93% of the best existing methods.
arXiv Detail & Related papers (2022-02-09T23:55:25Z) - Systematic Investigation of Strategies Tailored for Low-Resource
Settings for Sanskrit Dependency Parsing [14.416855042499945]
Existing state of the art approaches for Sanskrit Dependency Parsing (SDP) are hybrid in nature.
purely data-driven approaches do not match the performance of hybrid approaches due to labelled data sparsity.
We experiment with five strategies, namely, data augmentation, sequential transfer learning, cross-lingual/mono-lingual pretraining, multi-task learning and self-training.
Our proposed ensembled system outperforms the purely data-driven state of the art system by 2.8/3.9 points (Unlabelled Attachment Score (UAS)/Labelled Attachment Score (LAS)) absolute gain
arXiv Detail & Related papers (2022-01-27T08:24:53Z) - Textless Speech-to-Speech Translation on Real Data [49.134208897722246]
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language.
We tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data.
arXiv Detail & Related papers (2021-12-15T18:56:35Z) - Extracting and filtering paraphrases by bridging natural language
inference and paraphrasing [0.0]
We propose a novel methodology for the extraction of paraphrasing datasets from NLI datasets and cleaning existing paraphrasing datasets.
The results show high quality of extracted paraphrasing datasets and surprisingly high noise levels in two existing paraphrasing datasets.
arXiv Detail & Related papers (2021-11-13T14:06:37Z) - Integrating end-to-end neural and clustering-based diarization: Getting
the best of both worlds [71.36164750147827]
Clustering-based approaches assign speaker labels to speech regions by clustering speaker embeddings such as x-vectors.
End-to-end neural diarization (EEND) directly predicts diarization labels using a neural network.
We propose a simple but effective hybrid diarization framework that works with overlapped speech and for long recordings containing an arbitrary number of speakers.
arXiv Detail & Related papers (2020-10-26T06:33:02Z) - End-to-End Speech-Translation with Knowledge Distillation: FBK@IWSLT2020 [20.456325305495966]
This paper describes FBK's participation in the IWSLT 2020 offline speech translation (ST) task.
The task evaluates systems' ability to translate English TED talks audio into German texts.
Our system is an end-to-end model based on an adaptation of the Transformer for speech data.
arXiv Detail & Related papers (2020-06-04T15:47:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.