Phone Features Improve Speech Translation
- URL: http://arxiv.org/abs/2005.13681v1
- Date: Wed, 27 May 2020 22:05:10 GMT
- Title: Phone Features Improve Speech Translation
- Authors: Elizabeth Salesky and Alan W Black
- Abstract summary: End-to-end models for speech translation (ST) more tightly couple speech recognition (ASR) and machine translation (MT)
We compare cascaded and end-to-end models across high, medium, and low-resource conditions, and show that cascades remain stronger baselines.
We show that these features improve both architectures, closing the gap between end-to-end models and cascades, and outperforming previous academic work -- by up to 9 BLEU on our low-resource setting.
- Score: 69.54616570679343
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: End-to-end models for speech translation (ST) more tightly couple speech
recognition (ASR) and machine translation (MT) than a traditional cascade of
separate ASR and MT models, with simpler model architectures and the potential
for reduced error propagation. Their performance is often assumed to be
superior, though in many conditions this is not yet the case. We compare
cascaded and end-to-end models across high, medium, and low-resource
conditions, and show that cascades remain stronger baselines. Further, we
introduce two methods to incorporate phone features into ST models. We show
that these features improve both architectures, closing the gap between
end-to-end models and cascades, and outperforming previous academic work -- by
up to 9 BLEU on our low-resource setting.
Related papers
- SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models [64.40250409933752]
We build upon our previous publication by implementing a simple and efficient non-autoregressive (NAR) TTS framework, termed SimpleSpeech 2.
SimpleSpeech 2 effectively combines the strengths of both autoregressive (AR) and non-autoregressive (NAR) methods.
We show a significant improvement in generation performance and generation speed compared to our previous work and other state-of-the-art (SOTA) large-scale TTS models.
arXiv Detail & Related papers (2024-08-25T17:07:39Z) - Coupling Speech Encoders with Downstream Text Models [4.679869237248675]
We present a modular approach to building cascade speech translation models.
We preserve state-of-the-art speech recognition (ASR) and text translation (MT) performance for a given task.
arXiv Detail & Related papers (2024-07-24T19:29:13Z) - ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets [106.7760874400261]
This paper presents ML-SUPERB2.0, which is a new benchmark for evaluating pre-trained SSL and supervised speech models.
We find performance improvements over the setup of ML-SUPERB, but performance depends on the downstream model design.
Also, we find large performance differences between languages and datasets, suggesting the need for more targeted approaches.
arXiv Detail & Related papers (2024-06-12T21:01:26Z) - Mixture of Tokens: Continuous MoE through Cross-Example Aggregation [0.7880651741080428]
Mixture of Experts (MoE) models are pushing the boundaries of language and vision tasks.
MoT is a simple, continuous architecture that is capable of scaling the number of parameters similarly to sparse MoE models.
Our best models achieve a 3x increase in training speed over dense Transformer models in language pretraining.
arXiv Detail & Related papers (2023-10-24T16:03:57Z) - ON-TRAC Consortium Systems for the IWSLT 2022 Dialect and Low-resource
Speech Translation Tasks [8.651248939672769]
This paper describes the ON-TRAC Consortium translation systems developed for two challenge tracks featured in the Evaluation Campaign of IWSLT 2022: low-resource and dialect speech translation.
We build an end-to-end model as our joint primary submission, and compare it against cascaded models that leverage a large fine-tuned wav2vec 2.0 model for ASR.
Our results highlight that self-supervised models trained on smaller sets of target data are more effective to low-resource end-to-end ST fine-tuning, compared to large off-the-shelf models.
arXiv Detail & Related papers (2022-05-04T10:36:57Z) - Knowledge Distillation for Quality Estimation [79.51452598302934]
Quality Estimation (QE) is the task of automatically predicting Machine Translation quality in the absence of reference translations.
Recent success in QE stems from the use of multilingual pre-trained representations, where very large models lead to impressive results.
We show that this approach, in combination with data augmentation, leads to light-weight QE models that perform competitively with distilled pre-trained representations with 8x fewer parameters.
arXiv Detail & Related papers (2021-07-01T12:36:21Z) - Streaming Models for Joint Speech Recognition and Translation [11.657994715914748]
We develop an end-to-end streaming ST model based on a re-translation approach and compare against standard cascading approaches.
We also introduce a novel inference method for the joint case, interleaving both transcript and translation in generation and removing the need to use separate decoders.
arXiv Detail & Related papers (2021-01-22T15:16:54Z) - Tight Integrated End-to-End Training for Cascaded Speech Translation [40.76367623739673]
A cascaded speech translation model relies on discrete and non-differentiable transcription.
Direct speech translation is an alternative method to avoid error propagation.
This work explores the feasibility of collapsing the entire cascade components into a single end-to-end trainable model.
arXiv Detail & Related papers (2020-11-24T15:43:49Z) - Hierarchical Prosody Modeling for Non-Autoregressive Speech Synthesis [76.39883780990489]
We analyze the behavior of non-autoregressive TTS models under different prosody-modeling settings.
We propose a hierarchical architecture, in which the prediction of phoneme-level prosody features are conditioned on the word-level prosody features.
arXiv Detail & Related papers (2020-11-12T16:16:41Z) - A Streaming On-Device End-to-End Model Surpassing Server-Side
Conventional Model Quality and Latency [88.08721721440429]
We develop a first-pass Recurrent Neural Network Transducer (RNN-T) model and a second-pass Listen, Attend, Spell (LAS) rescorer.
We find that RNN-T+LAS offers a better WER and latency tradeoff compared to a conventional model.
arXiv Detail & Related papers (2020-03-28T05:00:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.