Cascaded Models With Cyclic Feedback For Direct Speech Translation
- URL: http://arxiv.org/abs/2010.11153v2
- Date: Thu, 11 Feb 2021 16:52:33 GMT
- Title: Cascaded Models With Cyclic Feedback For Direct Speech Translation
- Authors: Tsz Kin Lam, Shigehiko Schamoni, Stefan Riezler
- Abstract summary: We present a technique that allows cascades of automatic speech recognition (ASR) and machine translation (MT) to exploit in-domain direct speech translation data.
A comparison to end-to-end speech translation using components of identical architecture and the same data shows gains of up to 3.8 BLEU points on LibriVoxDeEn and up to 5.1 BLEU points on CoVoST for German-to-English speech translation.
- Score: 14.839931533868176
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Direct speech translation describes a scenario where only speech inputs and
corresponding translations are available. Such data are notoriously limited. We
present a technique that allows cascades of automatic speech recognition (ASR)
and machine translation (MT) to exploit in-domain direct speech translation
data in addition to out-of-domain MT and ASR data. After pre-training MT and
ASR, we use a feedback cycle where the downstream performance of the MT system
is used as a signal to improve the ASR system by self-training, and the MT
component is fine-tuned on multiple ASR outputs, making it more tolerant
towards spelling variations. A comparison to end-to-end speech translation
using components of identical architecture and the same data shows gains of up
to 3.8 BLEU points on LibriVoxDeEn and up to 5.1 BLEU points on CoVoST for
German-to-English speech translation.
Related papers
- Blending LLMs into Cascaded Speech Translation: KIT's Offline Speech Translation System for IWSLT 2024 [61.189875635090225]
Large Language Models (LLMs) are currently under exploration for various tasks, including Automatic Speech Recognition (ASR), Machine Translation (MT), and even End-to-End Speech Translation (ST)
arXiv Detail & Related papers (2024-06-24T16:38:17Z) - DiffNorm: Self-Supervised Normalization for Non-autoregressive Speech-to-speech Translation [29.76274107159478]
Non-autoregressive Transformers (NATs) are applied in direct speech-to-speech translation systems.
We introduce DiffNorm, a diffusion-based normalization strategy that simplifies data distributions for training NAT models.
Our strategies result in a notable improvement of about +7 ASR-BLEU for English-Spanish (En-Es) and +2 ASR-BLEU for English-French (En-Fr) on the CVSS benchmark.
arXiv Detail & Related papers (2024-05-22T01:10:39Z) - Prosody in Cascade and Direct Speech-to-Text Translation: a case study
on Korean Wh-Phrases [79.07111754406841]
This work proposes using contrastive evaluation to measure the ability of direct S2TT systems to disambiguate utterances where prosody plays a crucial role.
Our results clearly demonstrate the value of direct translation systems over cascade translation models.
arXiv Detail & Related papers (2024-02-01T14:46:35Z) - DiariST: Streaming Speech Translation with Speaker Diarization [53.595990270899414]
We propose DiariST, the first streaming ST and SD solution.
It is built upon a neural transducer-based streaming ST system and integrates token-level serialized output training and t-vector.
Our system achieves a strong ST and SD capability compared to offline systems based on Whisper, while performing streaming inference for overlapping speech.
arXiv Detail & Related papers (2023-09-14T19:33:27Z) - Simple and Effective Unsupervised Speech Translation [68.25022245914363]
We study a simple and effective approach to build speech translation systems without labeled data.
We present an unsupervised domain adaptation technique for pre-trained speech models.
Experiments show that unsupervised speech-to-text translation outperforms the previous unsupervised state of the art.
arXiv Detail & Related papers (2022-10-18T22:26:13Z) - Discrete Cross-Modal Alignment Enables Zero-Shot Speech Translation [71.35243644890537]
End-to-end Speech Translation (ST) aims at translating the source language speech into target language text without generating the intermediate transcriptions.
Existing zero-shot methods fail to align the two modalities of speech and text into a shared semantic space.
We propose a novel Discrete Cross-Modal Alignment (DCMA) method that employs a shared discrete vocabulary space to accommodate and match both modalities of speech and text.
arXiv Detail & Related papers (2022-10-18T03:06:47Z) - Large-Scale Streaming End-to-End Speech Translation with Neural
Transducers [35.2855796745394]
We introduce a streaming end-to-end speech translation (ST) model to convert audio signals to texts in other languages directly.
Compared with cascaded ST that performs ASR followed by text-based machine translation (MT), the proposed Transformer transducer (TT)-based ST model drastically reduces inference latency.
We extend TT-based ST to multilingual ST, which generates texts of multiple languages at the same time.
arXiv Detail & Related papers (2022-04-11T18:18:53Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - The IWSLT 2021 BUT Speech Translation Systems [2.4373900721120285]
BUT's English to German offline speech translation(ST) systems developed for IWSLT 2021.
They are based on jointly trained Automatic Speech Recognition-Machine Translation models.
Their performances is evaluated on MustC-Common test set.
arXiv Detail & Related papers (2021-07-13T15:11:18Z) - A Technical Report: BUT Speech Translation Systems [2.9327503320877457]
The paper describes the BUT's speech translation systems.
The systems are English$longrightarrow$German offline speech translation systems.
A large degradation is observed when translating ASR hypothesis compared to the oracle input text.
arXiv Detail & Related papers (2020-10-22T10:52:31Z) - Jointly Trained Transformers models for Spoken Language Translation [2.3886615435250302]
This work trains SLT systems with ASR objective as an auxiliary loss and both the networks are connected through neural hidden representations.
This architecture has improved from BLEU from 36.8 to 44.5.
All the experiments are reported on English-Portuguese speech translation task using How2 corpus.
arXiv Detail & Related papers (2020-04-25T11:28:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.