OLKAVS: An Open Large-Scale Korean Audio-Visual Speech Dataset
- URL: http://arxiv.org/abs/2301.06375v1
- Date: Mon, 16 Jan 2023 11:40:50 GMT
- Title: OLKAVS: An Open Large-Scale Korean Audio-Visual Speech Dataset
- Authors: Jeongkyun Park, Jung-Wook Hwang, Kwanghee Choi, Seung-Hyun Lee, Jun
Hwan Ahn, Rae-Hong Park, Hyung-Min Park
- Abstract summary: Open Large-scale Korean Audio-Visual Speech (OLKAVS) dataset is the largest among publicly available audio-visual speech datasets.
The dataset contains 1,150 hours of transcribed audio from 1,107 Korean speakers in a studio setup with nine different viewpoints and various noise situations.
- Score: 14.619865864254924
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Inspired by humans comprehending speech in a multi-modal manner, various
audio-visual datasets have been constructed. However, most existing datasets
focus on English, induce dependencies with various prediction models during
dataset preparation, and have only a small number of multi-view videos. To
mitigate the limitations, we recently developed the Open Large-scale Korean
Audio-Visual Speech (OLKAVS) dataset, which is the largest among publicly
available audio-visual speech datasets. The dataset contains 1,150 hours of
transcribed audio from 1,107 Korean speakers in a studio setup with nine
different viewpoints and various noise situations. We also provide the
pre-trained baseline models for two tasks, audio-visual speech recognition and
lip reading. We conducted experiments based on the models to verify the
effectiveness of multi-modal and multi-view training over uni-modal and
frontal-view-only training. We expect the OLKAVS dataset to facilitate
multi-modal research in broader areas such as Korean speech recognition,
speaker recognition, pronunciation level classification, and mouth motion
analysis.
Related papers
- YODAS: Youtube-Oriented Dataset for Audio and Speech [47.60574092241447]
YODAS is a large-scale, multilingual dataset comprising over 500k hours of speech data in more than 100 languages.
The labeled subsets, including manual or automatic subtitles, facilitate supervised model training.
YODAS is distinctive as the first publicly available dataset of its scale, and it is distributed under a Creative Commons license.
arXiv Detail & Related papers (2024-06-02T23:43:27Z) - XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception [62.660135152900615]
Speech recognition and translation systems perform poorly on noisy inputs.
XLAVS-R is a cross-lingual audio-visual speech representation model for noise-robust speech recognition and translation.
arXiv Detail & Related papers (2024-03-21T13:52:17Z) - Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation [55.15299351110525]
This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can recognize different languages with a single trained model.
We propose a novel training strategy, processing with visual speech units.
We set new state-of-the-art multilingual VSR performances by achieving comparable performances to the previous language-specific VSR models.
arXiv Detail & Related papers (2024-01-18T08:46:02Z) - Teach me with a Whisper: Enhancing Large Language Models for Analyzing
Spoken Transcripts using Speech Embeddings [8.660203441911554]
We propose a methodology for training language models leveraging spoken language audio data.
This leads to an improved language model for analyzing spoken transcripts while avoiding an audio processing overhead at test time.
In our experiments, the student model achieves consistent improvement over traditional language models on tasks analyzing spoken transcripts.
arXiv Detail & Related papers (2023-11-13T01:53:12Z) - Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning [50.28566759231076]
We propose an innovative, automatic approach to establish an audio dataset with high-quality captions.
Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs.
We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
arXiv Detail & Related papers (2023-09-20T17:59:32Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and
Dataset [53.46019570679092]
We propose a Vision-Audio-Language Omni-peRception pretraining model (VALOR) for multi-modal understanding and generation.
VALOR jointly models relationships of vision, audio and language in an end-to-end manner.
It achieves new state-of-the-art performances on series of public cross-modality benchmarks.
arXiv Detail & Related papers (2023-04-17T15:08:15Z) - Learning Cross-lingual Visual Speech Representations [108.68531445641769]
Cross-lingual self-supervised visual representation learning has been a growing research topic in the last few years.
We use the recently-proposed Raw Audio-Visual Speechs (RAVEn) framework to pre-train an audio-visual model with unlabelled data.
Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance.
arXiv Detail & Related papers (2023-03-14T17:05:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.