Creating Speech-to-Speech Corpus from Dubbed Series
- URL: http://arxiv.org/abs/2203.03601v1
- Date: Mon, 7 Mar 2022 18:52:48 GMT
- Title: Creating Speech-to-Speech Corpus from Dubbed Series
- Authors: Massa Baali, Wassim El-Hajj, Ahmed Ali
- Abstract summary: We propose an unsupervised approach to construct speech-to-speech corpus, aligned on short segment levels.
Our methodology exploits video frames, speech recognition, machine translation, and noisy frames removal algorithms to match segments in both languages.
Our pipeline was able to generate 17 hours of paired segments, which is about 47% of the corpus.
- Score: 8.21384946488751
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dubbed series are gaining a lot of popularity in recent years with strong
support from major media service providers. Such popularity is fueled by
studies that showed that dubbed versions of TV shows are more popular than
their subtitled equivalents. We propose an unsupervised approach to construct
speech-to-speech corpus, aligned on short segment levels, to produce a parallel
speech corpus in the source- and target- languages. Our methodology exploits
video frames, speech recognition, machine translation, and noisy frames removal
algorithms to match segments in both languages. To verify the performance of
the proposed method, we apply it on long and short dubbed clips. Out of 36
hours TR-AR dubbed series, our pipeline was able to generate 17 hours of paired
segments, which is about 47% of the corpus. We applied our method on another
language pair, EN-AR, to ensure it is robust enough and not tuned for a
specific language or a specific corpus. Regardless of the language pairs, the
accuracy of the paired segments was around 70% when evaluated using human
subjective evaluation. The corpus will be freely available for the research
community.
Related papers
- Character-aware audio-visual subtitling in context [58.95580154761008]
This paper presents an improved framework for character-aware audio-visual subtitling in TV shows.
Our approach integrates speech recognition, speaker diarisation, and character recognition, utilising both audio and visual cues.
We validate the method on a dataset with 12 TV shows, demonstrating superior performance in speaker diarisation and character recognition accuracy compared to existing approaches.
arXiv Detail & Related papers (2024-10-14T20:27:34Z) - Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine
Translation of Lecture Transcripts [50.00305136008848]
We propose a framework for parallel corpus mining, which provides a quick and effective way to mine a parallel corpus from publicly available lectures on Coursera.
For both English--Japanese and English--Chinese lecture translations, we extracted parallel corpora of approximately 50,000 lines and created development and test sets.
This study also suggests guidelines for gathering and cleaning corpora, mining parallel sentences, cleaning noise in the mined data, and creating high-quality evaluation splits.
arXiv Detail & Related papers (2023-11-07T03:50:25Z) - VideoDubber: Machine Translation with Speech-Aware Length Control for
Video Dubbing [73.56970726406274]
Video dubbing aims to translate the original speech in a film or television program into the speech in a target language.
To ensure the translated speech to be well aligned with the corresponding video, the length/duration of the translated speech should be as close as possible to that of the original speech.
We propose a machine translation system tailored for the task of video dubbing, which directly considers the speech duration of each token in translation.
arXiv Detail & Related papers (2022-11-30T12:09:40Z) - Speech Segmentation Optimization using Segmented Bilingual Speech Corpus
for End-to-end Speech Translation [16.630616128169372]
We propose a speech segmentation method using a binary classification model trained using a segmented bilingual speech corpus.
Experimental results revealed that the proposed method is more suitable for cascade and end-to-end ST systems than conventional segmentation methods.
arXiv Detail & Related papers (2022-03-29T12:26:56Z) - Lahjoita puhetta -- a large-scale corpus of spoken Finnish with some
benchmarks [9.160401226886947]
The Donate Speech campaign has so far succeeded in gathering approximately 3600 hours of ordinary, colloquial Finnish speech.
The primary goals of the collection were to create a representative, large-scale resource to study spontaneous spoken Finnish and to accelerate the development of language technology and speech-based services.
We present the collection process and the collected corpus, and showcase its versatility through multiple use cases.
arXiv Detail & Related papers (2022-03-24T07:50:25Z) - SHAS: Approaching optimal Segmentation for End-to-End Speech Translation [0.0]
Speech translation models are unable to directly process long audios, like TED talks, which have to be split into shorter segments.
We propose Supervised Hybrid Audio (SHAS), a method that can effectively learn the optimal segmentation from any manually segmented speech corpus.
Experiments on MuST-C and mTEDx show that SHAS retains 95-98% of the manual segmentation's BLEU score, compared to the 87-93% of the best existing methods.
arXiv Detail & Related papers (2022-02-09T23:55:25Z) - GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of
Transcribed Audio [88.20960848885575]
GigaSpeech is a multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training.
Around 40,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous speaking styles.
For system training, GigaSpeech provides five subsets of different sizes, 10h, 250h, 1000h, 2500h, and 10000h.
arXiv Detail & Related papers (2021-06-13T04:09:16Z) - Construction of a Large-scale Japanese ASR Corpus on TV Recordings [2.28438857884398]
This paper presents a new large-scale Japanese speech corpus for training automatic speech recognition (ASR) systems.
This corpus contains over 2,000 hours of speech with transcripts built on Japanese TV recordings and their subtitles.
arXiv Detail & Related papers (2021-03-26T21:14:12Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z) - "Listen, Understand and Translate": Triple Supervision Decouples
End-to-end Speech-to-text Translation [49.610188741500274]
An end-to-end speech-to-text translation (ST) takes audio in a source language and outputs the text in a target language.
Existing methods are limited by the amount of parallel corpus.
We build a system to fully utilize signals in a parallel ST corpus.
arXiv Detail & Related papers (2020-09-21T09:19:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.