Efficient Speech Translation with Dynamic Latent Perceivers
- URL: http://arxiv.org/abs/2210.16264v1
- Date: Fri, 28 Oct 2022 16:52:48 GMT
- Title: Efficient Speech Translation with Dynamic Latent Perceivers
- Authors: Ioannis Tsiamas, Gerard I. G\'allego, Jos\'e A. R. Fonollosa, Marta R.
Costa-juss\'a
- Abstract summary: Transformers have been the dominant architecture for Speech Translation, achieving significant improvements in translation quality.
We propose to ease the complexity by using a Perceiver encoder to map the speech inputs to a fixed-length latent representation.
We also introduce a novel way of training Perceivers, with Dynamic Latent Access (DLA), unlocking larger latent spaces without any additional computational overhead.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers have been the dominant architecture for Speech Translation in
recent years, achieving significant improvements in translation quality. Since
speech signals are longer than their textual counterparts, and due to the
quadratic complexity of the Transformer, a down-sampling step is essential for
its adoption in Speech Translation. Instead, in this research, we propose to
ease the complexity by using a Perceiver encoder to map the speech inputs to a
fixed-length latent representation. Furthermore, we introduce a novel way of
training Perceivers, with Dynamic Latent Access (DLA), unlocking larger latent
spaces without any additional computational overhead. Speech-to-Text Perceivers
with DLA can match the performance of a Transformer baseline across three
language pairs in MuST-C. Finally, a DLA-trained model is easily adaptable to
DLA at inference, and can be flexibly deployed with various computational
budgets, without significant drops in translation quality.
Related papers
- GLoT: A Novel Gated-Logarithmic Transformer for Efficient Sign Language Translation [0.0]
We propose a novel Gated-Logarithmic Transformer (GLoT) that captures the long-term temporal dependencies of the sign language as a time-series data.
Our results demonstrate that GLoT consistently outperforms the other models across all metrics.
arXiv Detail & Related papers (2025-02-17T14:31:00Z) - Cross-modality Data Augmentation for End-to-End Sign Language Translation [66.46877279084083]
End-to-end sign language translation (SLT) aims to convert sign language videos into spoken language texts directly without intermediate representations.
It has been a challenging task due to the modality gap between sign videos and texts and the data scarcity of labeled data.
We propose a novel Cross-modality Data Augmentation (XmDA) framework to transfer the powerful gloss-to-text translation capabilities to end-to-end sign language translation.
arXiv Detail & Related papers (2023-05-18T16:34:18Z) - Non-Parametric Domain Adaptation for End-to-End Speech Translation [72.37869362559212]
End-to-End Speech Translation (E2E-ST) has received increasing attention due to the potential of its less error propagation, lower latency, and fewer parameters.
We propose a novel non-parametric method that leverages domain-specific text translation corpus to achieve domain adaptation for the E2E-ST system.
arXiv Detail & Related papers (2022-05-23T11:41:02Z) - RealTranS: End-to-End Simultaneous Speech Translation with Convolutional
Weighted-Shrinking Transformer [33.876412404781846]
RealTranS is an end-to-end model for simultaneous speech translation.
It maps speech features into text space with a weighted-shrinking operation and a semantic encoder.
Experiments show that RealTranS with the Wait-K-Stride-N strategy outperforms prior end-to-end models.
arXiv Detail & Related papers (2021-06-09T06:35:46Z) - Multilingual Speech Translation with Unified Transformer: Huawei Noah's
Ark Lab at IWSLT 2021 [33.876412404781846]
This paper describes the system submitted to the IWSLT 2021 Speech Translation (MultiST) task from Huawei Noah's Ark Lab.
We use a unified transformer architecture for our MultiST model, so that the data from different modalities can be exploited to enhance the model's ability.
We apply several training techniques to improve the performance, including multi-task learning, task-level curriculum learning, data augmentation, etc.
arXiv Detail & Related papers (2021-06-01T02:50:49Z) - Learning Robust Latent Representations for Controllable Speech Synthesis [0.0]
We propose RTI-VAE (Reordered Transformer with Information reduction VAE) to minimize the mutual information between different latent variables.
We show that RTI-VAE reduces the cluster overlap of speaker attributes by at least 30% over LSTM-VAE and by at least 7% over vanilla Transformer-VAE.
arXiv Detail & Related papers (2021-05-10T15:49:03Z) - VECO: Variable and Flexible Cross-lingual Pre-training for Language
Understanding and Generation [77.82373082024934]
We plug a cross-attention module into the Transformer encoder to explicitly build the interdependence between languages.
It can effectively avoid the degeneration of predicting masked words only conditioned on the context in its own language.
The proposed cross-lingual model delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark.
arXiv Detail & Related papers (2020-10-30T03:41:38Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z) - Learning Source Phrase Representations for Neural Machine Translation [65.94387047871648]
We propose an attentive phrase representation generation mechanism which is able to generate phrase representations from corresponding token representations.
In our experiments, we obtain significant improvements on the WMT 14 English-German and English-French tasks on top of the strong Transformer baseline.
arXiv Detail & Related papers (2020-06-25T13:43:11Z) - Progressive Transformers for End-to-End Sign Language Production [43.45785951443149]
The goal of automatic Sign Language Production (SLP) is to translate spoken language to a continuous stream of sign language video.
Previous work on predominantly isolated SLP has shown the need for architectures that are better suited to the continuous domain of full sign sequences.
We propose Progressive Transformers, a novel architecture that can translate from discrete spoken language sentences to continuous 3D skeleton pose outputs representing sign language.
arXiv Detail & Related papers (2020-04-30T15:20:25Z) - Sign Language Transformers: Joint End-to-end Sign Language Recognition
and Translation [59.38247587308604]
We introduce a novel transformer based architecture that jointly learns Continuous Sign Language Recognition and Translation.
We evaluate the recognition and translation performances of our approaches on the challenging RWTH-PHOENIX-Weather-2014T dataset.
Our translation networks outperform both sign video to spoken language and gloss to spoken language translation models.
arXiv Detail & Related papers (2020-03-30T21:35:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.