Related papers: Revisiting End-to-End Speech-to-Text Translation From Scratch

Revisiting End-to-End Speech-to-Text Translation From Scratch

URL: http://arxiv.org/abs/2206.04571v1
Date: Thu, 9 Jun 2022 15:39:19 GMT
Title: Revisiting End-to-End Speech-to-Text Translation From Scratch
Authors: Biao Zhang, Barry Haddow, Rico Sennrich
Abstract summary: End-to-end (E2E) speech-to-text translation (ST) often depends on pretraining its encoder and/or decoder using source transcripts via speech recognition or text translation tasks. In this paper, we explore the extent to which the quality of E2E ST trained on speech-translation pairs alone can be improved.
Score: 48.203394370942505
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: End-to-end (E2E) speech-to-text translation (ST) often depends on pretraining its encoder and/or decoder using source transcripts via speech recognition or text translation tasks, without which translation performance drops substantially. However, transcripts are not always available, and how significant such pretraining is for E2E ST has rarely been studied in the literature. In this paper, we revisit this question and explore the extent to which the quality of E2E ST trained on speech-translation pairs alone can be improved. We reexamine several techniques proven beneficial to ST previously, and offer a set of best practices that biases a Transformer-based E2E ST system toward training from scratch. Besides, we propose parameterized distance penalty to facilitate the modeling of locality in the self-attention model for speech. On four benchmarks covering 23 languages, our experiments show that, without using any transcripts or pretraining, the proposed system reaches and even outperforms previous studies adopting pretraining, although the gap remains in (extremely) low-resource settings. Finally, we discuss neural acoustic feature modeling, where a neural model is designed to extract acoustic features from raw speech signals directly, with the goal to simplify inductive biases and add freedom to the model in describing speech. For the first time, we demonstrate its feasibility and show encouraging results on ST tasks.

Related papers

OZSpeech: One-step Zero-shot Speech Synthesis with Learned-Prior-Conditioned Flow Matching [3.05024318465243]
OZSpeech is the first TTS method to explore optimal transport conditional flow matching with one-step sampling.<n>Our approach operates on disentangled, factorized components of speech in token format, enabling accurate modeling of each speech attribute.<n> Experimental results show that our method achieves promising performance over existing methods in content accuracy, naturalness, prosody generation, and speaker style preservation.
arXiv Detail & Related papers (2025-05-19T07:31:55Z)
The Interpreter Understands Your Meaning: End-to-end Spoken Language Understanding Aided by Speech Translation [13.352795145385645]
Speech translation (ST) is a good means of pretraining speech models for end-to-end spoken language understanding. We show that our models reach higher performance over baselines on monolingual and multilingual intent classification. We also create new benchmark datasets for speech summarization and low-resource/zero-shot transfer from English to French or Spanish.
arXiv Detail & Related papers (2023-05-16T17:53:03Z)
Speech-to-Speech Translation For A Real-world Unwritten Language [62.414304258701804]
We study speech-to-speech translation (S2ST) that translates speech from one language into another language. We present an end-to-end solution from training data collection, modeling choices to benchmark dataset release.
arXiv Detail & Related papers (2022-11-11T20:21:38Z)
Textless Direct Speech-to-Speech Translation with Discrete Speech Representation [27.182170555234226]
We propose a novel model, Textless Translatotron, for training an end-to-end direct S2ST model without any textual supervision. When a speech encoder pre-trained with unsupervised speech data is used for both models, the proposed model obtains translation quality nearly on-par with Translatotron 2.
arXiv Detail & Related papers (2022-10-31T19:48:38Z)
Simple and Effective Unsupervised Speech Translation [68.25022245914363]
We study a simple and effective approach to build speech translation systems without labeled data. We present an unsupervised domain adaptation technique for pre-trained speech models. Experiments show that unsupervised speech-to-text translation outperforms the previous unsupervised state of the art.
arXiv Detail & Related papers (2022-10-18T22:26:13Z)
TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation. We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices. TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z)
Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data. We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task. This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z)
Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues. In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z)
SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text Joint Pre-Training [33.02912456062474]
We build a single encoder with the BERT objective on unlabeled text together with the w2v-BERT objective on unlabeled speech. We demonstrate that incorporating both speech and text data during pre-training can significantly improve downstream quality on CoVoST2 speech translation.
arXiv Detail & Related papers (2021-10-20T00:59:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.