Contextualized Translation of Automatically Segmented Speech
- URL: http://arxiv.org/abs/2008.02270v1
- Date: Wed, 5 Aug 2020 17:52:25 GMT
- Title: Contextualized Translation of Automatically Segmented Speech
- Authors: Marco Gaido, Mattia Antonino Di Gangi, Matteo Negri, Mauro Cettolo,
Marco Turchi
- Abstract summary: We train our models on randomly segmented data and compare two approaches: fine-tuning and adding the previous segment as context.
Our solution is more robust to VAD-segmented input, outperforming a strong base model and the fine-tuning on different VAD segmentations of an English-German test set by up to 4.25 BLEU points.
- Score: 20.334746967390164
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Direct speech-to-text translation (ST) models are usually trained on corpora
segmented at sentence level, but at inference time they are commonly fed with
audio split by a voice activity detector (VAD). Since VAD segmentation is not
syntax-informed, the resulting segments do not necessarily correspond to
well-formed sentences uttered by the speaker but, most likely, to fragments of
one or more sentences. This segmentation mismatch degrades considerably the
quality of ST models' output. So far, researchers have focused on improving
audio segmentation towards producing sentence-like splits. In this paper,
instead, we address the issue in the model, making it more robust to a
different, potentially sub-optimal segmentation. To this aim, we train our
models on randomly segmented data and compare two approaches: fine-tuning and
adding the previous segment as context. We show that our context-aware solution
is more robust to VAD-segmented input, outperforming a strong base model and
the fine-tuning on different VAD segmentations of an English-German test set by
up to 4.25 BLEU points.
Related papers
- Lightweight Audio Segmentation for Long-form Speech Translation [17.743473111298826]
We propose a segmentation model that achieves better speech translation quality with a small model size.
We also show that proper integration of the speech segmentation model into the underlying ST system is critical to improve overall translation quality at inference time.
arXiv Detail & Related papers (2024-06-15T08:02:15Z) - TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - REBORN: Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR [54.64158282822995]
We propose REBORN,Reinforcement-Learned Boundary with Iterative Training for Unsupervised ASR.
ReBORN alternates between training a segmentation model that predicts the boundaries of the segmental structures in speech signals and training the phoneme prediction model, whose input is the speech feature segmented by the segmentation model, to predict a phoneme transcription.
We conduct extensive experiments and find that under the same setting, REBORN outperforms all prior unsupervised ASR models on LibriSpeech, TIMIT, and five non-English languages in Multilingual LibriSpeech.
arXiv Detail & Related papers (2024-02-06T13:26:19Z) - Where's the Point? Self-Supervised Multilingual Punctuation-Agnostic
Sentence Segmentation [65.6736056006381]
We present a multilingual punctuation-agnostic sentence segmentation method covering 85 languages.
Our method outperforms all the prior best sentence-segmentation tools by an average of 6.1% F1 points.
By using our method to match sentence segmentation to the segmentation used during training of MT models, we achieve an average improvement of 2.3 BLEU points.
arXiv Detail & Related papers (2023-05-30T09:49:42Z) - End-to-End Simultaneous Speech Translation with Differentiable
Segmentation [21.03142288187605]
SimulST outputs translation while receiving the streaming speech inputs.
segmenting the speech inputs at unfavorable moments can disrupt the acoustic integrity and adversely affect the performance of the translation model.
We propose Differentiable segmentation (DiSeg) for SimulST to directly learn segmentation from the underlying translation model.
arXiv Detail & Related papers (2023-05-25T14:25:12Z) - Context-aware Fine-tuning of Self-supervised Speech Models [56.95389222319555]
We study the use of context, i.e., surrounding segments, during fine-tuning.
We propose a new approach called context-aware fine-tuning.
We evaluate the proposed approach using the SLUE and Libri-light benchmarks for several downstream tasks.
arXiv Detail & Related papers (2022-12-16T15:46:15Z) - Smart Speech Segmentation using Acousto-Linguistic Features with
look-ahead [3.579111205766969]
We present a hybrid approach that leverages both acoustic and language information to improve segmentation.
On average, our models improve segmentation-F0.5 score by 9.8% over baseline.
For the downstream task of machine translation, it improves the translation BLEU score by an average of 1.05 points.
arXiv Detail & Related papers (2022-10-26T03:36:31Z) - Speech Segmentation Optimization using Segmented Bilingual Speech Corpus
for End-to-end Speech Translation [16.630616128169372]
We propose a speech segmentation method using a binary classification model trained using a segmented bilingual speech corpus.
Experimental results revealed that the proposed method is more suitable for cascade and end-to-end ST systems than conventional segmentation methods.
arXiv Detail & Related papers (2022-03-29T12:26:56Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - SHAS: Approaching optimal Segmentation for End-to-End Speech Translation [0.0]
Speech translation models are unable to directly process long audios, like TED talks, which have to be split into shorter segments.
We propose Supervised Hybrid Audio (SHAS), a method that can effectively learn the optimal segmentation from any manually segmented speech corpus.
Experiments on MuST-C and mTEDx show that SHAS retains 95-98% of the manual segmentation's BLEU score, compared to the 87-93% of the best existing methods.
arXiv Detail & Related papers (2022-02-09T23:55:25Z) - FragmentVC: Any-to-Any Voice Conversion by End-to-End Extracting and
Fusing Fine-Grained Voice Fragments With Attention [66.77490220410249]
We propose FragmentVC, in which the latent phonetic structure of the utterance from the source speaker is obtained from Wav2Vec 2.0.
FragmentVC is able to extract fine-grained voice fragments from the target speaker utterance(s) and fuse them into the desired utterance.
This approach is trained with reconstruction loss only without any disentanglement considerations between content and speaker information.
arXiv Detail & Related papers (2020-10-27T09:21:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.