CI-AVSR: A Cantonese Audio-Visual Speech Dataset for In-car Command
Recognition
- URL: http://arxiv.org/abs/2201.03804v1
- Date: Tue, 11 Jan 2022 06:32:12 GMT
- Title: CI-AVSR: A Cantonese Audio-Visual Speech Dataset for In-car Command
Recognition
- Authors: Wenliang Dai, Samuel Cahyawijaya, Tiezheng Yu, Elham J. Barezi, Peng
Xu, Cheuk Tung Shadow Yiu, Rita Frieske, Holy Lovenia, Genta Indra Winata,
Qifeng Chen, Xiaojuan Ma, Bertram E. Shi, Pascale Fung
- Abstract summary: We introduce a new dataset, Cantonese In-car Audio-Visual Speech Recognition (CI-AVSR)
It consists of 4,984 samples (8.3 hours) of 200 in-car commands recorded by 30 native Cantonese speakers.
We provide detailed statistics of both the clean and the augmented versions of our dataset.
- Score: 91.33781557979819
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the rise of deep learning and intelligent vehicle, the smart assistant
has become an essential in-car component to facilitate driving and provide
extra functionalities. In-car smart assistants should be able to process
general as well as car-related commands and perform corresponding actions,
which eases driving and improves safety. However, there is a data scarcity
issue for low resource languages, hindering the development of research and
applications. In this paper, we introduce a new dataset, Cantonese In-car
Audio-Visual Speech Recognition (CI-AVSR), for in-car command recognition in
the Cantonese language with both video and audio data. It consists of 4,984
samples (8.3 hours) of 200 in-car commands recorded by 30 native Cantonese
speakers. Furthermore, we augment our dataset using common in-car background
noises to simulate real environments, producing a dataset 10 times larger than
the collected one. We provide detailed statistics of both the clean and the
augmented versions of our dataset. Moreover, we implement two multimodal
baselines to demonstrate the validity of CI-AVSR. Experiment results show that
leveraging the visual signal improves the overall performance of the model.
Although our best model can achieve a considerable quality on the clean test
set, the speech recognition quality on the noisy data is still inferior and
remains as an extremely challenging task for real in-car speech recognition
systems. The dataset and code will be released at
https://github.com/HLTCHKUST/CI-AVSR.
Related papers
- Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels [100.43280310123784]
We investigate the use of automatically-generated transcriptions of unlabelled datasets to increase the training set size.
We demonstrate that increasing the size of the training set, a recent trend in the literature, leads to reduced WER despite using noisy transcriptions.
The proposed model achieves new state-of-the-art performance on AV-ASR on LRS2 and LRS3.
arXiv Detail & Related papers (2023-03-25T00:37:34Z) - AV-data2vec: Self-supervised Learning of Audio-Visual Speech
Representations with Contextualized Target Representations [88.30635799280923]
We introduce AV-data2vec which builds audio-visual representations based on predicting contextualized representations.
Results on LRS3 show that AV-data2vec consistently outperforms existing methods with the same amount of data and model size.
arXiv Detail & Related papers (2023-02-10T02:55:52Z) - Kaggle Competition: Cantonese Audio-Visual Speech Recognition for In-car
Commands [48.155806720847394]
In-car smart assistants should be able to process general as well as car-related commands.
Most datasets are in major languages, such as English and Chinese.
We propose Cantonese Audio-Visual Speech Recognition for In-car Commands.
arXiv Detail & Related papers (2022-07-06T13:31:56Z) - Exploring Capabilities of Monolingual Audio Transformers using Large
Datasets in Automatic Speech Recognition of Czech [0.9653976364051563]
We present our progress in pretraining Czech monolingual audio transformers from a large dataset containing more than 80 thousand hours of unlabeled speech.
We are presenting a large palette of experiments with various fine-tuning setups evaluated on two public datasets.
arXiv Detail & Related papers (2022-06-15T16:14:37Z) - Automatic Speech Recognition Datasets in Cantonese Language: A Survey
and a New Dataset [85.52036362232688]
Our dataset consists of 73.6 hours of clean read speech paired with transcripts, collected from Cantonese audiobooks from Hong Kong.
It combines philosophy, politics, education, culture, lifestyle and family domains, covering a wide range of topics.
We create a powerful and robust Cantonese ASR model by applying multi-dataset learning on MDCC and Common Voice zh-HK.
arXiv Detail & Related papers (2022-01-07T12:09:15Z) - Multiresolution and Multimodal Speech Recognition with Transformers [22.995102995029576]
This paper presents an audio visual automatic speech recognition (AV-ASR) system using a Transformer-based architecture.
We focus on the scene context provided by the visual information, to ground the ASR.
Our results are comparable to state-of-the-art Listen, Attend and Spell-based architectures.
arXiv Detail & Related papers (2020-04-29T09:32:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.