Beyond Voice Activity Detection: Hybrid Audio Segmentation for Direct
Speech Translation
- URL: http://arxiv.org/abs/2104.11710v1
- Date: Fri, 23 Apr 2021 16:54:13 GMT
- Title: Beyond Voice Activity Detection: Hybrid Audio Segmentation for Direct
Speech Translation
- Authors: Marco Gaido, Matteo Negri, Mauro Cettolo, Marco Turchi
- Abstract summary: We show that our methods outperform all the other techniques, reducing by at least 30% the gap between the traditional VAD-based approach and optimal manual segmentation.
- Score: 14.151063458445826
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The audio segmentation mismatch between training data and those seen at
run-time is a major problem in direct speech translation. Indeed, while systems
are usually trained on manually segmented corpora, in real use cases they are
often presented with continuous audio requiring automatic (and sub-optimal)
segmentation. After comparing existing techniques (VAD-based, fixed-length and
hybrid segmentation methods), in this paper we propose enhanced hybrid
solutions to produce better results without sacrificing latency. Through
experiments on different domains and language pairs, we show that our methods
outperform all the other techniques, reducing by at least 30% the gap between
the traditional VAD-based approach and optimal manual segmentation.
Related papers
- Character-aware audio-visual subtitling in context [58.95580154761008]
This paper presents an improved framework for character-aware audio-visual subtitling in TV shows.
Our approach integrates speech recognition, speaker diarisation, and character recognition, utilising both audio and visual cues.
We validate the method on a dataset with 12 TV shows, demonstrating superior performance in speaker diarisation and character recognition accuracy compared to existing approaches.
arXiv Detail & Related papers (2024-10-14T20:27:34Z) - Efficient Temporal Action Segmentation via Boundary-aware Query Voting [51.92693641176378]
BaFormer is a boundary-aware Transformer network that tokenizes each video segment as an instance token.
BaFormer significantly reduces the computational costs, utilizing only 6% of the running time.
arXiv Detail & Related papers (2024-05-25T00:44:13Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural
Machine Translation [51.881877192924414]
Sub-word segmentation is an essential pre-processing step for Neural Machine Translation (NMT)
This paper introduces SelfSeg, a self-supervised neural sub-word segmentation method.
SelfSeg is much faster to train/decode and requires only monolingual dictionaries instead of parallel corpora.
arXiv Detail & Related papers (2023-07-31T04:38:47Z) - Smart Speech Segmentation using Acousto-Linguistic Features with
look-ahead [3.579111205766969]
We present a hybrid approach that leverages both acoustic and language information to improve segmentation.
On average, our models improve segmentation-F0.5 score by 9.8% over baseline.
For the downstream task of machine translation, it improves the translation BLEU score by an average of 1.05 points.
arXiv Detail & Related papers (2022-10-26T03:36:31Z) - Don't Discard Fixed-Window Audio Segmentation in Speech-to-Text
Translation [10.799623536095226]
For online spoken language translation, where models need to start translating before the full utterance is spoken, most previous work has ignored the segmentation problem.
We compare various methods for improving models' robustness towards segmentation errors and different segmentation strategies in both offline and online settings.
Our findings on five different language pairs show that a simple fixed-window audio segmentation can perform surprisingly well given the right conditions.
arXiv Detail & Related papers (2022-10-24T16:06:33Z) - Speech Segmentation Optimization using Segmented Bilingual Speech Corpus
for End-to-end Speech Translation [16.630616128169372]
We propose a speech segmentation method using a binary classification model trained using a segmented bilingual speech corpus.
Experimental results revealed that the proposed method is more suitable for cascade and end-to-end ST systems than conventional segmentation methods.
arXiv Detail & Related papers (2022-03-29T12:26:56Z) - End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned.
We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem.
Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z) - SHAS: Approaching optimal Segmentation for End-to-End Speech Translation [0.0]
Speech translation models are unable to directly process long audios, like TED talks, which have to be split into shorter segments.
We propose Supervised Hybrid Audio (SHAS), a method that can effectively learn the optimal segmentation from any manually segmented speech corpus.
Experiments on MuST-C and mTEDx show that SHAS retains 95-98% of the manual segmentation's BLEU score, compared to the 87-93% of the best existing methods.
arXiv Detail & Related papers (2022-02-09T23:55:25Z) - Dealing with training and test segmentation mismatch: FBK@IWSLT2021 [13.89298686257514]
This paper describes FBK's system submission to the IWSLT 2021 Offline Speech Translation task.
It is a Transformer-based architecture trained to translate English speech audio data into German texts.
The training pipeline is characterized by knowledge distillation and a two-step fine-tuning procedure.
arXiv Detail & Related papers (2021-06-23T18:11:32Z) - Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual
Speech Separation [73.1652905564163]
We address the problem of separating individual speech signals from videos using audio-visual neural processing.
Most conventional approaches utilize frame-wise matching criteria to extract shared information between co-occurring audio and video.
We propose a cross-modal affinity network (CaffNet) that learns global correspondence as well as locally-varying affinities between audio and visual streams.
arXiv Detail & Related papers (2021-03-25T15:39:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.