End-to-End Simultaneous Speech Translation with Differentiable
Segmentation
- URL: http://arxiv.org/abs/2305.16093v2
- Date: Sun, 18 Jun 2023 03:18:12 GMT
- Title: End-to-End Simultaneous Speech Translation with Differentiable
Segmentation
- Authors: Shaolei Zhang, Yang Feng
- Abstract summary: SimulST outputs translation while receiving the streaming speech inputs.
segmenting the speech inputs at unfavorable moments can disrupt the acoustic integrity and adversely affect the performance of the translation model.
We propose Differentiable segmentation (DiSeg) for SimulST to directly learn segmentation from the underlying translation model.
- Score: 21.03142288187605
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: End-to-end simultaneous speech translation (SimulST) outputs translation
while receiving the streaming speech inputs (a.k.a. streaming speech
translation), and hence needs to segment the speech inputs and then translate
based on the current received speech. However, segmenting the speech inputs at
unfavorable moments can disrupt the acoustic integrity and adversely affect the
performance of the translation model. Therefore, learning to segment the speech
inputs at those moments that are beneficial for the translation model to
produce high-quality translation is the key to SimulST. Existing SimulST
methods, either using the fixed-length segmentation or external segmentation
model, always separate segmentation from the underlying translation model,
where the gap results in segmentation outcomes that are not necessarily
beneficial for the translation process. In this paper, we propose
Differentiable Segmentation (DiSeg) for SimulST to directly learn segmentation
from the underlying translation model. DiSeg turns hard segmentation into
differentiable through the proposed expectation training, enabling it to be
jointly trained with the translation model and thereby learn
translation-beneficial segmentation. Experimental results demonstrate that
DiSeg achieves state-of-the-art performance and exhibits superior segmentation
capability.
Related papers
- TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - REBORN: Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR [54.64158282822995]
We propose REBORN,Reinforcement-Learned Boundary with Iterative Training for Unsupervised ASR.
ReBORN alternates between training a segmentation model that predicts the boundaries of the segmental structures in speech signals and training the phoneme prediction model, whose input is the speech feature segmented by the segmentation model, to predict a phoneme transcription.
We conduct extensive experiments and find that under the same setting, REBORN outperforms all prior unsupervised ASR models on LibriSpeech, TIMIT, and five non-English languages in Multilingual LibriSpeech.
arXiv Detail & Related papers (2024-02-06T13:26:19Z) - Soft Alignment of Modality Space for End-to-end Speech Translation [49.29045524083467]
End-to-end Speech Translation aims to convert speech into target text within a unified model.
The inherent differences between speech and text modalities often impede effective cross-modal and cross-lingual transfer.
We introduce Soft Alignment (S-Align), using adversarial training to align the representation spaces of both modalities.
arXiv Detail & Related papers (2023-12-18T06:08:51Z) - Long-Form End-to-End Speech Translation via Latent Alignment
Segmentation [6.153530338207679]
Current simultaneous speech translation models can process audio only up to a few seconds long.
We propose a novel segmentation approach for a low-latency end-to-end speech translation.
We show that the proposed approach achieves state-of-the-art quality at no additional computational cost.
arXiv Detail & Related papers (2023-09-20T15:10:12Z) - Shiftable Context: Addressing Training-Inference Context Mismatch in
Simultaneous Speech Translation [0.17188280334580192]
Transformer models using segment-based processing have been an effective architecture for simultaneous speech translation.
We propose Shiftable Context to ensure consistent segment and context sizes are maintained throughout training and inference.
arXiv Detail & Related papers (2023-07-03T22:11:51Z) - Context-aware Fine-tuning of Self-supervised Speech Models [56.95389222319555]
We study the use of context, i.e., surrounding segments, during fine-tuning.
We propose a new approach called context-aware fine-tuning.
We evaluate the proposed approach using the SLUE and Libri-light benchmarks for several downstream tasks.
arXiv Detail & Related papers (2022-12-16T15:46:15Z) - Don't Discard Fixed-Window Audio Segmentation in Speech-to-Text
Translation [10.799623536095226]
For online spoken language translation, where models need to start translating before the full utterance is spoken, most previous work has ignored the segmentation problem.
We compare various methods for improving models' robustness towards segmentation errors and different segmentation strategies in both offline and online settings.
Our findings on five different language pairs show that a simple fixed-window audio segmentation can perform surprisingly well given the right conditions.
arXiv Detail & Related papers (2022-10-24T16:06:33Z) - Speech Segmentation Optimization using Segmented Bilingual Speech Corpus
for End-to-end Speech Translation [16.630616128169372]
We propose a speech segmentation method using a binary classification model trained using a segmented bilingual speech corpus.
Experimental results revealed that the proposed method is more suitable for cascade and end-to-end ST systems than conventional segmentation methods.
arXiv Detail & Related papers (2022-03-29T12:26:56Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z) - Contextualized Translation of Automatically Segmented Speech [20.334746967390164]
We train our models on randomly segmented data and compare two approaches: fine-tuning and adding the previous segment as context.
Our solution is more robust to VAD-segmented input, outperforming a strong base model and the fine-tuning on different VAD segmentations of an English-German test set by up to 4.25 BLEU points.
arXiv Detail & Related papers (2020-08-05T17:52:25Z) - Relative Positional Encoding for Speech Recognition and Direct
Translation [72.64499573561922]
We adapt the relative position encoding scheme to the Speech Transformer.
As a result, the network can better adapt to the variable distributions present in speech data.
arXiv Detail & Related papers (2020-05-20T09:53:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.