Related papers: AdaTranS: Adapting with Boundary-based Shrinking for End-to-End Speech Translation

AdaTranS: Adapting with Boundary-based Shrinking for End-to-End Speech Translation

URL: http://arxiv.org/abs/2212.08911v1
Date: Sat, 17 Dec 2022 16:14:30 GMT
Title: AdaTranS: Adapting with Boundary-based Shrinking for End-to-End Speech Translation
Authors: Xingshan Zeng, Liangyou Li and Qun Liu
Abstract summary: AdaTranS adapts the speech features with a new shrinking mechanism to mitigate the length mismatch between speech and text features. Experiments on the MUST-C dataset demonstrate that AdaTranS achieves better performance than the other shrinking-based methods.
Score: 36.12146100483228
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: To alleviate the data scarcity problem in End-to-end speech translation (ST), pre-training on data for speech recognition and machine translation is considered as an important technique. However, the modality gap between speech and text prevents the ST model from efficiently inheriting knowledge from the pre-trained models. In this work, we propose AdaTranS for end-to-end ST. It adapts the speech features with a new shrinking mechanism to mitigate the length mismatch between speech and text features by predicting word boundaries. Experiments on the MUST-C dataset demonstrate that AdaTranS achieves better performance than the other shrinking-based methods, with higher inference speed and lower memory usage. Further experiments also show that AdaTranS can be equipped with additional alignment losses to further improve performance.

Related papers

Pushing the Limits of Zero-shot End-to-End Speech Translation [15.725310520335785]
Data scarcity and the modality gap between the speech and text modalities are two major obstacles of end-to-end Speech Translation (ST) systems. We introduce ZeroSwot, a method for zero-shot ST that bridges the modality gap without any paired ST data. Our experiments show that we can effectively close the modality gap without ST data, while our results on MuST-C and CoVoST demonstrate our method's superiority.
arXiv Detail & Related papers (2024-02-16T03:06:37Z)
Unsupervised Pre-Training For Data-Efficient Text-to-Speech On Low Resource Languages [15.32264927462068]
We propose an unsupervised pre-training method for a sequence-to-sequence TTS model by leveraging large untranscribed speech data. The main idea is to pre-train the model to reconstruct de-warped mel-spectrograms from warped ones. We empirically demonstrate the effectiveness of our proposed method in low-resource language scenarios.
arXiv Detail & Related papers (2023-03-28T01:26:00Z)
LongFNT: Long-form Speech Recognition with Factorized Neural Transducer [64.75547712366784]
We propose the LongFNT-Text architecture, which fuses the sentence-level long-form features directly with the output of the vocabulary predictor. The effectiveness of our LongFNT approach is validated on LibriSpeech and GigaSpeech corpora with 19% and 12% relative word error rate(WER) reduction, respectively.
arXiv Detail & Related papers (2022-11-17T08:48:27Z)
Simple and Effective Unsupervised Speech Translation [68.25022245914363]
We study a simple and effective approach to build speech translation systems without labeled data. We present an unsupervised domain adaptation technique for pre-trained speech models. Experiments show that unsupervised speech-to-text translation outperforms the previous unsupervised state of the art.
arXiv Detail & Related papers (2022-10-18T22:26:13Z)
Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues. In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z)
ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in Text-to-Speech [96.0009517132463]
We introduce a word-level prosody encoder, which quantizes the low-frequency band of the speech and compresses prosody attributes in the latent prosody vector (LPV) We then introduce an LPV predictor, which predicts LPV given word sequence and fine-tune it on the high-quality TTS dataset. Experimental results show that ProsoSpeech can generate speech with richer prosody compared with baseline methods.
arXiv Detail & Related papers (2022-02-16T01:42:32Z)
Tackling data scarcity in speech translation using zero-shot multilingual machine translation techniques [12.968557512440759]
Several techniques have been proposed for zero-shot translation. We investigate whether these ideas can be applied to speech translation, by building ST models trained on speech transcription and text translation data. The techniques were successfully applied to few-shot ST using limited ST data, with improvements of up to +12.9 BLEU points compared to direct end-to-end ST and +3.1 BLEU points compared to ST models fine-tuned from ASR model.
arXiv Detail & Related papers (2022-01-26T20:20:59Z)
A study on the efficacy of model pre-training in developing neural text-to-speech system [55.947807261757056]
This study aims to understand better why and how model pre-training can positively contribute to TTS system performance. It is found that the TTS system could achieve comparable performance when the pre-training data is reduced to 1/8 of its original size.
arXiv Detail & Related papers (2021-10-08T02:09:28Z)
RealTranS: End-to-End Simultaneous Speech Translation with Convolutional Weighted-Shrinking Transformer [33.876412404781846]
RealTranS is an end-to-end model for simultaneous speech translation. It maps speech features into text space with a weighted-shrinking operation and a semantic encoder. Experiments show that RealTranS with the Wait-K-Stride-N strategy outperforms prior end-to-end models.
arXiv Detail & Related papers (2021-06-09T06:35:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.