Don't Discard Fixed-Window Audio Segmentation in Speech-to-Text
Translation
- URL: http://arxiv.org/abs/2210.13363v1
- Date: Mon, 24 Oct 2022 16:06:33 GMT
- Title: Don't Discard Fixed-Window Audio Segmentation in Speech-to-Text
Translation
- Authors: Chantal Amrhein and Barry Haddow
- Abstract summary: For online spoken language translation, where models need to start translating before the full utterance is spoken, most previous work has ignored the segmentation problem.
We compare various methods for improving models' robustness towards segmentation errors and different segmentation strategies in both offline and online settings.
Our findings on five different language pairs show that a simple fixed-window audio segmentation can perform surprisingly well given the right conditions.
- Score: 10.799623536095226
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: For real-life applications, it is crucial that end-to-end spoken language
translation models perform well on continuous audio, without relying on
human-supplied segmentation. For online spoken language translation, where
models need to start translating before the full utterance is spoken, most
previous work has ignored the segmentation problem. In this paper, we compare
various methods for improving models' robustness towards segmentation errors
and different segmentation strategies in both offline and online settings and
report results on translation quality, flicker and delay. Our findings on five
different language pairs show that a simple fixed-window audio segmentation can
perform surprisingly well given the right conditions.
Related papers
- Multilingual Contrastive Decoding via Language-Agnostic Layers Skipping [60.458273797431836]
Decoding by contrasting layers (DoLa) is designed to improve the generation quality of large language models.
We find that this approach does not work well on non-English tasks.
Inspired by previous interpretability work on language transition during the model's forward pass, we propose an improved contrastive decoding algorithm.
arXiv Detail & Related papers (2024-07-15T15:14:01Z) - Lightweight Audio Segmentation for Long-form Speech Translation [17.743473111298826]
We propose a segmentation model that achieves better speech translation quality with a small model size.
We also show that proper integration of the speech segmentation model into the underlying ST system is critical to improve overall translation quality at inference time.
arXiv Detail & Related papers (2024-06-15T08:02:15Z) - TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - SpeechAlign: Aligning Speech Generation to Human Preferences [51.684183257809075]
We introduce SpeechAlign, an iterative self-improvement strategy that aligns speech language models to human preferences.
We show that SpeechAlign can bridge the distribution gap and facilitate continuous self-improvement of the speech language model.
arXiv Detail & Related papers (2024-04-08T15:21:17Z) - Long-Form End-to-End Speech Translation via Latent Alignment
Segmentation [6.153530338207679]
Current simultaneous speech translation models can process audio only up to a few seconds long.
We propose a novel segmentation approach for a low-latency end-to-end speech translation.
We show that the proposed approach achieves state-of-the-art quality at no additional computational cost.
arXiv Detail & Related papers (2023-09-20T15:10:12Z) - End-to-End Simultaneous Speech Translation with Differentiable
Segmentation [21.03142288187605]
SimulST outputs translation while receiving the streaming speech inputs.
segmenting the speech inputs at unfavorable moments can disrupt the acoustic integrity and adversely affect the performance of the translation model.
We propose Differentiable segmentation (DiSeg) for SimulST to directly learn segmentation from the underlying translation model.
arXiv Detail & Related papers (2023-05-25T14:25:12Z) - M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for
Multilingual Speech to Image Retrieval [56.49878599920353]
This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval.
For non-English image-speech retrieval, we outperform the current state-of-the-art performance by a wide margin both when training separate models for each language, and with a single model which processes speech in all three languages.
arXiv Detail & Related papers (2022-11-02T14:54:45Z) - Improving Distortion Robustness of Self-supervised Speech Processing
Tasks with Domain Adaptation [60.26511271597065]
Speech distortions are a long-standing problem that degrades the performance of supervisely trained speech processing models.
It is high time that we enhance the robustness of speech processing models to obtain good performance when encountering speech distortions.
arXiv Detail & Related papers (2022-03-30T07:25:52Z) - Beyond Voice Activity Detection: Hybrid Audio Segmentation for Direct
Speech Translation [14.151063458445826]
We show that our methods outperform all the other techniques, reducing by at least 30% the gap between the traditional VAD-based approach and optimal manual segmentation.
arXiv Detail & Related papers (2021-04-23T16:54:13Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z) - Contextualized Translation of Automatically Segmented Speech [20.334746967390164]
We train our models on randomly segmented data and compare two approaches: fine-tuning and adding the previous segment as context.
Our solution is more robust to VAD-segmented input, outperforming a strong base model and the fine-tuning on different VAD segmentations of an English-German test set by up to 4.25 BLEU points.
arXiv Detail & Related papers (2020-08-05T17:52:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.