Jointly Trained Transformers models for Spoken Language Translation
- URL: http://arxiv.org/abs/2004.12111v1
- Date: Sat, 25 Apr 2020 11:28:39 GMT
- Title: Jointly Trained Transformers models for Spoken Language Translation
- Authors: Hari Krishna Vydana, Martin Karafi'at, Katerina Zmolikova, Luk'as
Burget, Honza Cernocky
- Abstract summary: This work trains SLT systems with ASR objective as an auxiliary loss and both the networks are connected through neural hidden representations.
This architecture has improved from BLEU from 36.8 to 44.5.
All the experiments are reported on English-Portuguese speech translation task using How2 corpus.
- Score: 2.3886615435250302
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Conventional spoken language translation (SLT) systems are pipeline based
systems, where we have an Automatic Speech Recognition (ASR) system to convert
the modality of source from speech to text and a Machine Translation (MT)
systems to translate source text to text in target language. Recent progress in
the sequence-sequence architectures have reduced the performance gap between
the pipeline based SLT systems (cascaded ASR-MT) and End-to-End approaches.
Though End-to-End and cascaded ASR-MT systems are reaching to the comparable
levels of performances, we can see a large performance gap using the ASR
hypothesis and oracle text w.r.t MT models. This performance gap indicates that
the MT systems are prone to large performance degradation due to noisy ASR
hypothesis as opposed to oracle text transcript. In this work this degradation
in the performance is reduced by creating an end to-end differentiable pipeline
between the ASR and MT systems. In this work, we train SLT systems with ASR
objective as an auxiliary loss and both the networks are connected through the
neural hidden representations. This train ing would have an End-to-End
differentiable path w.r.t to the final objective function as well as utilize
the ASR objective for better performance of the SLT systems. This architecture
has improved from BLEU from 36.8 to 44.5. Due to the Multi-task training the
model also generates the ASR hypothesis which are used by a pre-trained MT
model. Combining the proposed systems with the MT model has increased the BLEU
score by 1. All the experiments are reported on English-Portuguese speech
translation task using How2 corpus. The final BLEU score is on-par with the
best speech translation system on How2 dataset with no additional training data
and language model and much less parameters.
Related papers
- Blending LLMs into Cascaded Speech Translation: KIT's Offline Speech Translation System for IWSLT 2024 [61.189875635090225]
Large Language Models (LLMs) are currently under exploration for various tasks, including Automatic Speech Recognition (ASR), Machine Translation (MT), and even End-to-End Speech Translation (ST)
arXiv Detail & Related papers (2024-06-24T16:38:17Z) - On the Relevance of Phoneme Duration Variability of Synthesized Training
Data for Automatic Speech Recognition [0.552480439325792]
We focus on the temporal structure of synthetic data and its relation to ASR training.
We show how much the degradation of synthetic data quality is influenced by duration modeling in non-autoregressive TTS.
Using a simple algorithm we shift phoneme duration distributions of the TTS system closer to real durations.
arXiv Detail & Related papers (2023-10-12T08:45:21Z) - Adapted Multimodal BERT with Layer-wise Fusion for Sentiment Analysis [84.12658971655253]
We propose Adapted Multimodal BERT, a BERT-based architecture for multimodal tasks.
adapter adjusts the pretrained language model for the task at hand, while the fusion layers perform task-specific, layer-wise fusion of audio-visual information with textual BERT representations.
In our ablations we see that this approach leads to efficient models, that can outperform their fine-tuned counterparts and are robust to input noise.
arXiv Detail & Related papers (2022-12-01T17:31:42Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - Attention-based Multi-hypothesis Fusion for Speech Summarization [83.04957603852571]
Speech summarization can be achieved by combining automatic speech recognition (ASR) and text summarization (TS)
ASR errors directly affect the quality of the output summary in the cascade approach.
We propose a cascade speech summarization model that is robust to ASR errors and that exploits multiple hypotheses generated by ASR to attenuate the effect of ASR errors on the summary.
arXiv Detail & Related papers (2021-11-16T03:00:29Z) - The USYD-JD Speech Translation System for IWSLT 2021 [85.64797317290349]
This paper describes the University of Sydney& JD's joint submission of the IWSLT 2021 low resource speech translation task.
We trained our models with the officially provided ASR and MT datasets.
To achieve better translation performance, we explored the most recent effective strategies, including back translation, knowledge distillation, multi-feature reranking and transductive finetuning.
arXiv Detail & Related papers (2021-07-24T09:53:34Z) - The IWSLT 2021 BUT Speech Translation Systems [2.4373900721120285]
BUT's English to German offline speech translation(ST) systems developed for IWSLT 2021.
They are based on jointly trained Automatic Speech Recognition-Machine Translation models.
Their performances is evaluated on MustC-Common test set.
arXiv Detail & Related papers (2021-07-13T15:11:18Z) - A Technical Report: BUT Speech Translation Systems [2.9327503320877457]
The paper describes the BUT's speech translation systems.
The systems are English$longrightarrow$German offline speech translation systems.
A large degradation is observed when translating ASR hypothesis compared to the oracle input text.
arXiv Detail & Related papers (2020-10-22T10:52:31Z) - Cascaded Models With Cyclic Feedback For Direct Speech Translation [14.839931533868176]
We present a technique that allows cascades of automatic speech recognition (ASR) and machine translation (MT) to exploit in-domain direct speech translation data.
A comparison to end-to-end speech translation using components of identical architecture and the same data shows gains of up to 3.8 BLEU points on LibriVoxDeEn and up to 5.1 BLEU points on CoVoST for German-to-English speech translation.
arXiv Detail & Related papers (2020-10-21T17:18:51Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.