Related papers: Contextualized Translation of Automatically Segmented Speech

Contextualized Translation of Automatically Segmented Speech

URL: http://arxiv.org/abs/2008.02270v1
Date: Wed, 5 Aug 2020 17:52:25 GMT
Title: Contextualized Translation of Automatically Segmented Speech
Authors: Marco Gaido, Mattia Antonino Di Gangi, Matteo Negri, Mauro Cettolo, Marco Turchi
Abstract summary: We train our models on randomly segmented data and compare two approaches: fine-tuning and adding the previous segment as context. Our solution is more robust to VAD-segmented input, outperforming a strong base model and the fine-tuning on different VAD segmentations of an English-German test set by up to 4.25 BLEU points.
Score: 20.334746967390164
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Direct speech-to-text translation (ST) models are usually trained on corpora segmented at sentence level, but at inference time they are commonly fed with audio split by a voice activity detector (VAD). Since VAD segmentation is not syntax-informed, the resulting segments do not necessarily correspond to well-formed sentences uttered by the speaker but, most likely, to fragments of one or more sentences. This segmentation mismatch degrades considerably the quality of ST models' output. So far, researchers have focused on improving audio segmentation towards producing sentence-like splits. In this paper, instead, we address the issue in the model, making it more robust to a different, potentially sub-optimal segmentation. To this aim, we train our models on randomly segmented data and compare two approaches: fine-tuning and adding the previous segment as context. We show that our context-aware solution is more robust to VAD-segmented input, outperforming a strong base model and the fine-tuning on different VAD segmentations of an English-German test set by up to 4.25 BLEU points.

Related papers

Lightweight Audio Segmentation for Long-form Speech Translation [17.743473111298826]
We propose a segmentation model that achieves better speech translation quality with a small model size. We also show that proper integration of the speech segmentation model into the underlying ST system is critical to improve overall translation quality at inference time.
arXiv Detail & Related papers (2024-06-15T08:02:15Z)
TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion. We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process. Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z)
REBORN: Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR [54.64158282822995]
We propose REBORN,Reinforcement-Learned Boundary with Iterative Training for Unsupervised ASR. ReBORN alternates between training a segmentation model that predicts the boundaries of the segmental structures in speech signals and training the phoneme prediction model, whose input is the speech feature segmented by the segmentation model, to predict a phoneme transcription. We conduct extensive experiments and find that under the same setting, REBORN outperforms all prior unsupervised ASR models on LibriSpeech, TIMIT, and five non-English languages in Multilingual LibriSpeech.
arXiv Detail & Related papers (2024-02-06T13:26:19Z)
Where's the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation [65.6736056006381]
We present a multilingual punctuation-agnostic sentence segmentation method covering 85 languages. Our method outperforms all the prior best sentence-segmentation tools by an average of 6.1% F1 points. By using our method to match sentence segmentation to the segmentation used during training of MT models, we achieve an average improvement of 2.3 BLEU points.
arXiv Detail & Related papers (2023-05-30T09:49:42Z)
End-to-End Simultaneous Speech Translation with Differentiable Segmentation [21.03142288187605]
SimulST outputs translation while receiving the streaming speech inputs. segmenting the speech inputs at unfavorable moments can disrupt the acoustic integrity and adversely affect the performance of the translation model. We propose Differentiable segmentation (DiSeg) for SimulST to directly learn segmentation from the underlying translation model.
arXiv Detail & Related papers (2023-05-25T14:25:12Z)
Context-aware Fine-tuning of Self-supervised Speech Models [56.95389222319555]
We study the use of context, i.e., surrounding segments, during fine-tuning. We propose a new approach called context-aware fine-tuning. We evaluate the proposed approach using the SLUE and Libri-light benchmarks for several downstream tasks.
arXiv Detail & Related papers (2022-12-16T15:46:15Z)
Smart Speech Segmentation using Acousto-Linguistic Features with look-ahead [3.579111205766969]
We present a hybrid approach that leverages both acoustic and language information to improve segmentation. On average, our models improve segmentation-F0.5 score by 9.8% over baseline. For the downstream task of machine translation, it improves the translation BLEU score by an average of 1.05 points.
arXiv Detail & Related papers (2022-10-26T03:36:31Z)
Speech Segmentation Optimization using Segmented Bilingual Speech Corpus for End-to-end Speech Translation [16.630616128169372]
We propose a speech segmentation method using a binary classification model trained using a segmented bilingual speech corpus. Experimental results revealed that the proposed method is more suitable for cascade and end-to-end ST systems than conventional segmentation methods.
arXiv Detail & Related papers (2022-03-29T12:26:56Z)
Speaker Embedding-aware Neural Diarization: a Novel Framework for Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem. We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z)
SHAS: Approaching optimal Segmentation for End-to-End Speech Translation [0.0]
Speech translation models are unable to directly process long audios, like TED talks, which have to be split into shorter segments. We propose Supervised Hybrid Audio (SHAS), a method that can effectively learn the optimal segmentation from any manually segmented speech corpus. Experiments on MuST-C and mTEDx show that SHAS retains 95-98% of the manual segmentation's BLEU score, compared to the 87-93% of the best existing methods.
arXiv Detail & Related papers (2022-02-09T23:55:25Z)
FragmentVC: Any-to-Any Voice Conversion by End-to-End Extracting and Fusing Fine-Grained Voice Fragments With Attention [66.77490220410249]
We propose FragmentVC, in which the latent phonetic structure of the utterance from the source speaker is obtained from Wav2Vec 2.0. FragmentVC is able to extract fine-grained voice fragments from the target speaker utterance(s) and fuse them into the desired utterance. This approach is trained with reconstruction loss only without any disentanglement considerations between content and speaker information.
arXiv Detail & Related papers (2020-10-27T09:21:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.