Weakly Supervised Construction of ASR Systems with Massive Video Data
- URL: http://arxiv.org/abs/2008.01300v2
- Date: Sat, 19 Sep 2020 07:22:35 GMT
- Title: Weakly Supervised Construction of ASR Systems with Massive Video Data
- Authors: Mengli Cheng, Chengyu Wang, Xu Hu, Jun Huang, Xiaobo Wang
- Abstract summary: We present a weakly supervised framework for constructing ASR systems with massive video data.
We propose an effective approach to extract high-quality audios aligned with transcripts from videos based on Optical Character Recognition (OCR)
Our framework can easily produce state-of-the-art results on six public datasets for Mandarin speech recognition.
- Score: 18.5050375783871
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Building Automatic Speech Recognition (ASR) systems from scratch is
significantly challenging, mostly due to the time-consuming and
financially-expensive process of annotating a large amount of audio data with
transcripts. Although several unsupervised pre-training models have been
proposed, applying such models directly might still be sub-optimal if more
labeled, training data could be obtained without a large cost. In this paper,
we present a weakly supervised framework for constructing ASR systems with
massive video data. As videos often contain human-speech audios aligned with
subtitles, we consider videos as an important knowledge source, and propose an
effective approach to extract high-quality audios aligned with transcripts from
videos based on Optical Character Recognition (OCR). The underlying ASR model
can be fine-tuned to fit any domain-specific target training datasets after
weakly supervised pre-training. Extensive experiments show that our framework
can easily produce state-of-the-art results on six public datasets for Mandarin
speech recognition.
Related papers
- Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech Representations [16.577870835480585]
We present a comprehensive analysis on building ASR systems with discrete codes.
We investigate different methods for training such as quantization schemes and time-domain vs spectral feature encodings.
We introduce a pipeline that outperforms Encodec at similar bit-rate.
arXiv Detail & Related papers (2024-07-03T20:51:41Z) - Text-to-feature diffusion for audio-visual few-shot learning [59.45164042078649]
Few-shot learning from video data is a challenging and underexplored, yet much cheaper, setup.
We introduce a unified audio-visual few-shot video classification benchmark on three datasets.
We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual few-shot learning.
arXiv Detail & Related papers (2023-09-07T17:30:36Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot
AV-ASR [79.21857972093332]
We present AVFormer, a method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation.
We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters.
We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively.
arXiv Detail & Related papers (2023-03-29T07:24:28Z) - SVTS: Scalable Video-to-Speech Synthesis [105.29009019733803]
We introduce a scalable video-to-speech framework consisting of two components: a video-to-spectrogram predictor and a pre-trained neural vocoder.
We are the first to show intelligible results on the challenging LRS3 dataset.
arXiv Detail & Related papers (2022-05-04T13:34:07Z) - Automatic Curation of Large-Scale Datasets for Audio-Visual
Representation Learning [62.47593143542552]
We describe a subset optimization approach for automatic dataset curation.
We demonstrate that our approach finds videos with high audio-visual correspondence and show that self-supervised models trained on our data, despite being automatically constructed, achieve similar downstream performances to existing video datasets with similar scales.
arXiv Detail & Related papers (2021-01-26T14:27:47Z) - Unsupervised Audiovisual Synthesis via Exemplar Autoencoders [59.13989658692953]
We present an unsupervised approach that converts the input speech of any individual into audiovisual streams of potentially-infinitely many output speakers.
We use Exemplar Autoencoders to learn the voice, stylistic prosody, and visual appearance of a specific target speech exemplar.
arXiv Detail & Related papers (2020-01-13T18:56:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.