Construction of a Large-scale Japanese ASR Corpus on TV Recordings
- URL: http://arxiv.org/abs/2103.14736v1
- Date: Fri, 26 Mar 2021 21:14:12 GMT
- Title: Construction of a Large-scale Japanese ASR Corpus on TV Recordings
- Authors: Shintaro Ando, Hiromasa Fujihara
- Abstract summary: This paper presents a new large-scale Japanese speech corpus for training automatic speech recognition (ASR) systems.
This corpus contains over 2,000 hours of speech with transcripts built on Japanese TV recordings and their subtitles.
- Score: 2.28438857884398
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a new large-scale Japanese speech corpus for training
automatic speech recognition (ASR) systems. This corpus contains over 2,000
hours of speech with transcripts built on Japanese TV recordings and their
subtitles. We develop herein an iterative workflow to extract matching audio
and subtitle segments from TV recordings based on a conventional method for
lightly-supervised audio-to-text alignment. We evaluate a model trained with
our corpus using an evaluation dataset built on Japanese TEDx presentation
videos and confirm that the performance is better than that trained with the
Corpus of Spontaneous Japanese (CSJ). The experiment results show the
usefulness of our corpus for training ASR systems. This corpus is made public
for the research community along with Kaldi scripts for training the models
reported in this paper.
Related papers
- Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine
Translation of Lecture Transcripts [50.00305136008848]
We propose a framework for parallel corpus mining, which provides a quick and effective way to mine a parallel corpus from publicly available lectures on Coursera.
For both English--Japanese and English--Chinese lecture translations, we extracted parallel corpora of approximately 50,000 lines and created development and test sets.
This study also suggests guidelines for gathering and cleaning corpora, mining parallel sentences, cleaning noise in the mined data, and creating high-quality evaluation splits.
arXiv Detail & Related papers (2023-11-07T03:50:25Z) - DiariST: Streaming Speech Translation with Speaker Diarization [53.595990270899414]
We propose DiariST, the first streaming ST and SD solution.
It is built upon a neural transducer-based streaming ST system and integrates token-level serialized output training and t-vector.
Our system achieves a strong ST and SD capability compared to offline systems based on Whisper, while performing streaming inference for overlapping speech.
arXiv Detail & Related papers (2023-09-14T19:33:27Z) - ClArTTS: An Open-Source Classical Arabic Text-to-Speech Corpus [3.1925030748447747]
We present a speech corpus for Classical Arabic Text-to-Speech (ClArTTS) to support the development of end-to-end TTS systems for Arabic.
The speech is extracted from a LibriVox audiobook, which is then processed, segmented, and manually transcribed and annotated.
The final ClArTTS corpus contains about 12 hours of speech from a single male speaker sampled at 40100 kHz.
arXiv Detail & Related papers (2023-02-28T20:18:59Z) - BASPRO: a balanced script producer for speech corpus collection based on
the genetic algorithm [29.701197643765674]
The performance of speech-processing models is heavily influenced by the speech corpus that is used for training and evaluation.
We propose BAlanced Script PROducer (BASPRO) system, which can automatically construct a phonetically balanced and rich set of Chinese sentences.
arXiv Detail & Related papers (2022-12-11T02:05:30Z) - SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder
Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder.
Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z) - TALCS: An Open-Source Mandarin-English Code-Switching Corpus and a
Speech Recognition Baseline [0.0]
This paper introduces a new corpus of Mandarin-English code-switching speech recognition--TALCS corpus.
TALCS corpus is derived from real online one-to-one English teaching scenes in TAL education group.
To our best knowledge, TALCS corpus is the largest well labeled Mandarin-English code-switching open source automatic speech recognition dataset in the world.
arXiv Detail & Related papers (2022-06-27T09:30:25Z) - Creating Speech-to-Speech Corpus from Dubbed Series [8.21384946488751]
We propose an unsupervised approach to construct speech-to-speech corpus, aligned on short segment levels.
Our methodology exploits video frames, speech recognition, machine translation, and noisy frames removal algorithms to match segments in both languages.
Our pipeline was able to generate 17 hours of paired segments, which is about 47% of the corpus.
arXiv Detail & Related papers (2022-03-07T18:52:48Z) - Cascaded Multilingual Audio-Visual Learning from Videos [49.44796976615445]
We propose a cascaded approach that leverages a model trained on English videos and applies it to audio-visual data in other languages.
With our cascaded approach, we show an improvement in retrieval performance of nearly 10x compared to training on the Japanese videos solely.
We also apply the model trained on English videos to Japanese and Hindi spoken captions of images, achieving state-of-the-art performance.
arXiv Detail & Related papers (2021-11-08T20:53:50Z) - BSTC: A Large-Scale Chinese-English Speech Translation Dataset [26.633433687767553]
BSTC (Baidu Speech Translation Corpus) is a large-scale Chinese-English speech translation dataset.
This dataset is constructed based on a collection of licensed videos of talks or lectures, including about 68 hours of Mandarin data.
We have asked three experienced interpreters to simultaneously interpret the testing talks in a mock conference setting.
arXiv Detail & Related papers (2021-04-08T07:38:51Z) - "Listen, Understand and Translate": Triple Supervision Decouples
End-to-end Speech-to-text Translation [49.610188741500274]
An end-to-end speech-to-text translation (ST) takes audio in a source language and outputs the text in a target language.
Existing methods are limited by the amount of parallel corpus.
We build a system to fully utilize signals in a parallel ST corpus.
arXiv Detail & Related papers (2020-09-21T09:19:07Z) - AVLnet: Learning Audio-Visual Language Representations from
Instructional Videos [69.56522471911396]
We introduce the Audio-Video Language Network (AVLnet), a self-supervised network that learns a shared audio-visual embedding space directly from raw video inputs.
We train AVLnet on HowTo100M, a large corpus of publicly available instructional videos, and evaluate on image retrieval and video retrieval tasks.
Our code, data, and trained models will be released at avlnet.csail.mit.edu.
arXiv Detail & Related papers (2020-06-16T14:38:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.