Lightweight Audio Segmentation for Long-form Speech Translation
- URL: http://arxiv.org/abs/2406.10549v1
- Date: Sat, 15 Jun 2024 08:02:15 GMT
- Title: Lightweight Audio Segmentation for Long-form Speech Translation
- Authors: Jaesong Lee, Soyoon Kim, Hanbyul Kim, Joon Son Chung,
- Abstract summary: We propose a segmentation model that achieves better speech translation quality with a small model size.
We also show that proper integration of the speech segmentation model into the underlying ST system is critical to improve overall translation quality at inference time.
- Score: 17.743473111298826
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech segmentation is an essential part of speech translation (ST) systems in real-world scenarios. Since most ST models are designed to process speech segments, long-form audio must be partitioned into shorter segments before translation. Recently, data-driven approaches for the speech segmentation task have been developed. Although the approaches improve overall translation quality, a performance gap exists due to a mismatch between the models and ST systems. In addition, the prior works require large self-supervised speech models, which consume significant computational resources. In this work, we propose a segmentation model that achieves better speech translation quality with a small model size. We propose an ASR-with-punctuation task as an effective pre-training strategy for the segmentation model. We also show that proper integration of the speech segmentation model into the underlying ST system is critical to improve overall translation quality at inference time.
Related papers
- An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios [76.11409260727459]
This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system.
We demonstrate that the similarity in phonetics between the pre-training and target languages, as well as the language category, affects the target language's adaptation performance.
arXiv Detail & Related papers (2024-06-13T08:16:52Z) - TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation [56.913182262166316]
Chain-of-Information Generation (CoIG) is a method for decoupling semantic and perceptual information in large-scale speech generation.
SpeechGPT-Gen is efficient in semantic and perceptual information modeling.
It markedly excels in zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue.
arXiv Detail & Related papers (2024-01-24T15:25:01Z) - Long-Form End-to-End Speech Translation via Latent Alignment
Segmentation [6.153530338207679]
Current simultaneous speech translation models can process audio only up to a few seconds long.
We propose a novel segmentation approach for a low-latency end-to-end speech translation.
We show that the proposed approach achieves state-of-the-art quality at no additional computational cost.
arXiv Detail & Related papers (2023-09-20T15:10:12Z) - Diffusion Models for Open-Vocabulary Segmentation [79.02153797465324]
OVDiff is a novel method that leverages generative text-to-image diffusion models for unsupervised open-vocabulary segmentation.
It relies solely on pre-trained components and outputs the synthesised segmenter directly, without training.
arXiv Detail & Related papers (2023-06-15T17:51:28Z) - Context-aware Fine-tuning of Self-supervised Speech Models [56.95389222319555]
We study the use of context, i.e., surrounding segments, during fine-tuning.
We propose a new approach called context-aware fine-tuning.
We evaluate the proposed approach using the SLUE and Libri-light benchmarks for several downstream tasks.
arXiv Detail & Related papers (2022-12-16T15:46:15Z) - Don't Discard Fixed-Window Audio Segmentation in Speech-to-Text
Translation [10.799623536095226]
For online spoken language translation, where models need to start translating before the full utterance is spoken, most previous work has ignored the segmentation problem.
We compare various methods for improving models' robustness towards segmentation errors and different segmentation strategies in both offline and online settings.
Our findings on five different language pairs show that a simple fixed-window audio segmentation can perform surprisingly well given the right conditions.
arXiv Detail & Related papers (2022-10-24T16:06:33Z) - An Exploration of Prompt Tuning on Generative Spoken Language Model for
Speech Processing Tasks [112.1942546460814]
We report the first exploration of the prompt tuning paradigm for speech processing tasks based on Generative Spoken Language Model (GSLM)
Experiment results show that the prompt tuning technique achieves competitive performance in speech classification tasks with fewer trainable parameters than fine-tuning specialized downstream models.
arXiv Detail & Related papers (2022-03-31T03:26:55Z) - Segmenting Subtitles for Correcting ASR Segmentation Errors [11.854481771567503]
We propose a model for correcting the acoustic segmentation of ASR models for low-resource languages.
We train a neural tagging model for correcting ASR acoustic segmentation and show that it improves downstream performance.
arXiv Detail & Related papers (2021-04-16T03:04:10Z) - Subtitles to Segmentation: Improving Low-Resource Speech-to-Text
Translation Pipelines [15.669334598926342]
We focus on improving ASR output segmentation in the context of low-resource language speech-to-text translation.
We use datasets of subtitles from TV shows and movies to train better ASR segmentation models.
We show that this noisy syntactic information can improve model accuracy.
arXiv Detail & Related papers (2020-10-19T17:32:40Z) - Contextualized Translation of Automatically Segmented Speech [20.334746967390164]
We train our models on randomly segmented data and compare two approaches: fine-tuning and adding the previous segment as context.
Our solution is more robust to VAD-segmented input, outperforming a strong base model and the fine-tuning on different VAD segmentations of an English-German test set by up to 4.25 BLEU points.
arXiv Detail & Related papers (2020-08-05T17:52:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.