SHAS: Approaching optimal Segmentation for End-to-End Speech Translation
- URL: http://arxiv.org/abs/2202.04774v1
- Date: Wed, 9 Feb 2022 23:55:25 GMT
- Title: SHAS: Approaching optimal Segmentation for End-to-End Speech Translation
- Authors: Ioannis Tsiamas, Gerard I. G\'allego, Jos\'e A. R. Fonollosa, Marta R.
Costa-juss\`a
- Abstract summary: Speech translation models are unable to directly process long audios, like TED talks, which have to be split into shorter segments.
We propose Supervised Hybrid Audio (SHAS), a method that can effectively learn the optimal segmentation from any manually segmented speech corpus.
Experiments on MuST-C and mTEDx show that SHAS retains 95-98% of the manual segmentation's BLEU score, compared to the 87-93% of the best existing methods.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech translation models are unable to directly process long audios, like
TED talks, which have to be split into shorter segments. Speech translation
datasets provide manual segmentations of the audios, which are not available in
real-world scenarios, and existing segmentation methods usually significantly
reduce translation quality at inference time. To bridge the gap between the
manual segmentation of training and the automatic one at inference, we propose
Supervised Hybrid Audio Segmentation (SHAS), a method that can effectively
learn the optimal segmentation from any manually segmented speech corpus.
First, we train a classifier to identify the included frames in a segmentation,
using speech representations from a pre-trained wav2vec 2.0. The optimal
splitting points are then found by a probabilistic Divide-and-Conquer algorithm
that progressively splits at the frame of lowest probability until all segments
are below a pre-specified length. Experiments on MuST-C and mTEDx show that the
translation of the segments produced by our method approaches the quality of
the manual segmentation on 5 languages pairs. Namely, SHAS retains 95-98% of
the manual segmentation's BLEU score, compared to the 87-93% of the best
existing methods. Our method is additionally generalizable to different domains
and achieves high zero-shot performance in unseen languages.
Related papers
- Efficient Temporal Action Segmentation via Boundary-aware Query Voting [51.92693641176378]
BaFormer is a boundary-aware Transformer network that tokenizes each video segment as an instance token.
BaFormer significantly reduces the computational costs, utilizing only 6% of the running time.
arXiv Detail & Related papers (2024-05-25T00:44:13Z) - Segment and Caption Anything [126.20201216616137]
We propose a method to efficiently equip the Segment Anything Model with the ability to generate regional captions.
By introducing a lightweight query-based feature mixer, we align the region-specific features with the embedding space of language models for later caption generation.
We conduct extensive experiments to demonstrate the superiority of our method and validate each design choice.
arXiv Detail & Related papers (2023-12-01T19:00:17Z) - Long-Form End-to-End Speech Translation via Latent Alignment
Segmentation [6.153530338207679]
Current simultaneous speech translation models can process audio only up to a few seconds long.
We propose a novel segmentation approach for a low-latency end-to-end speech translation.
We show that the proposed approach achieves state-of-the-art quality at no additional computational cost.
arXiv Detail & Related papers (2023-09-20T15:10:12Z) - SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural
Machine Translation [51.881877192924414]
Sub-word segmentation is an essential pre-processing step for Neural Machine Translation (NMT)
This paper introduces SelfSeg, a self-supervised neural sub-word segmentation method.
SelfSeg is much faster to train/decode and requires only monolingual dictionaries instead of parallel corpora.
arXiv Detail & Related papers (2023-07-31T04:38:47Z) - Diffusion Models for Open-Vocabulary Segmentation [79.02153797465324]
OVDiff is a novel method that leverages generative text-to-image diffusion models for unsupervised open-vocabulary segmentation.
It relies solely on pre-trained components and outputs the synthesised segmenter directly, without training.
arXiv Detail & Related papers (2023-06-15T17:51:28Z) - Where's the Point? Self-Supervised Multilingual Punctuation-Agnostic
Sentence Segmentation [65.6736056006381]
We present a multilingual punctuation-agnostic sentence segmentation method covering 85 languages.
Our method outperforms all the prior best sentence-segmentation tools by an average of 6.1% F1 points.
By using our method to match sentence segmentation to the segmentation used during training of MT models, we achieve an average improvement of 2.3 BLEU points.
arXiv Detail & Related papers (2023-05-30T09:49:42Z) - Smart Speech Segmentation using Acousto-Linguistic Features with
look-ahead [3.579111205766969]
We present a hybrid approach that leverages both acoustic and language information to improve segmentation.
On average, our models improve segmentation-F0.5 score by 9.8% over baseline.
For the downstream task of machine translation, it improves the translation BLEU score by an average of 1.05 points.
arXiv Detail & Related papers (2022-10-26T03:36:31Z) - Speech Segmentation Optimization using Segmented Bilingual Speech Corpus
for End-to-end Speech Translation [16.630616128169372]
We propose a speech segmentation method using a binary classification model trained using a segmented bilingual speech corpus.
Experimental results revealed that the proposed method is more suitable for cascade and end-to-end ST systems than conventional segmentation methods.
arXiv Detail & Related papers (2022-03-29T12:26:56Z) - Beyond Voice Activity Detection: Hybrid Audio Segmentation for Direct
Speech Translation [14.151063458445826]
We show that our methods outperform all the other techniques, reducing by at least 30% the gap between the traditional VAD-based approach and optimal manual segmentation.
arXiv Detail & Related papers (2021-04-23T16:54:13Z) - Multi-view Subword Regularization [111.04350390045705]
Multi-view Subword Regularization (MVR) is a method that enforces the consistency between predictions of using inputs tokenized by the standard and probabilistic segmentations.
Results on the XTREME multilingual benchmark show that MVR brings consistent improvements of up to 2.5 points over using standard segmentation algorithms.
arXiv Detail & Related papers (2021-03-15T16:07:42Z) - Contextualized Translation of Automatically Segmented Speech [20.334746967390164]
We train our models on randomly segmented data and compare two approaches: fine-tuning and adding the previous segment as context.
Our solution is more robust to VAD-segmented input, outperforming a strong base model and the fine-tuning on different VAD segmentations of an English-German test set by up to 4.25 BLEU points.
arXiv Detail & Related papers (2020-08-05T17:52:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.