MAVD: The First Open Large-Scale Mandarin Audio-Visual Dataset with
Depth Information
- URL: http://arxiv.org/abs/2306.02263v1
- Date: Sun, 4 Jun 2023 05:00:12 GMT
- Title: MAVD: The First Open Large-Scale Mandarin Audio-Visual Dataset with
Depth Information
- Authors: Jianrong Wang, Yuchen Huo, Li Liu, Tianyi Xu, Qi Li, Sen Li
- Abstract summary: This work establishes the MAVD, a new large-scale Mandarin multimodal corpus comprising 12,484 utterances spoken by 64 native Chinese speakers.
To ensure the dataset covers diverse real-world scenarios, a pipeline for cleaning and filtering the raw text material has been developed.
In particular, the latest data acquisition device of Microsoft, Azure Kinect is used to capture depth information in addition to the traditional audio signals and RGB images during data acquisition.
- Score: 21.864200803678003
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio-visual speech recognition (AVSR) gains increasing attention from
researchers as an important part of human-computer interaction. However, the
existing available Mandarin audio-visual datasets are limited and lack the
depth information. To address this issue, this work establishes the MAVD, a new
large-scale Mandarin multimodal corpus comprising 12,484 utterances spoken by
64 native Chinese speakers. To ensure the dataset covers diverse real-world
scenarios, a pipeline for cleaning and filtering the raw text material has been
developed to create a well-balanced reading material. In particular, the latest
data acquisition device of Microsoft, Azure Kinect is used to capture depth
information in addition to the traditional audio signals and RGB images during
data acquisition. We also provide a baseline experiment, which could be used to
evaluate the effectiveness of the dataset. The dataset and code will be
released at https://github.com/SpringHuo/MAVD.
Related papers
- Remastering Divide and Remaster: A Cinematic Audio Source Separation Dataset with Multilingual Support [5.926447149127937]
We develop version 3 of the Divide and Remaster (DnR) dataset.
This work addresses issues relating to vocal content in non-dialogue stems, loudness distributions, mastering process, and linguistic diversity.
Benchmark results using the Bandit model indicated that training on multilingual data yields significant generalizability to the model.
arXiv Detail & Related papers (2024-07-09T23:39:37Z) - The NeurIPS 2023 Machine Learning for Audio Workshop: Affective Audio Benchmarks and Novel Data [28.23517306589778]
The NeurIPS 2023 Machine Learning for Audio Workshop brings together machine learning (ML) experts from various audio domains.
There are several valuable audio-driven ML tasks, from speech emotion recognition to audio event detection, but the community is sparse compared to other ML areas.
High-quality data collection is time-consuming and costly, making it challenging for academic groups to apply their often state-of-the-art strategies to a larger, more generalizable dataset.
arXiv Detail & Related papers (2024-03-21T00:13:59Z) - AV-Deepfake1M: A Large-Scale LLM-Driven Audio-Visual Deepfake Dataset [21.90332221144928]
We propose the AV-Deepfake1M dataset for the detection and localization of deepfake audio-visual content.
The dataset contains content-driven (i) video manipulations, (ii) audio manipulations, and (iii) audio-visual manipulations for more than 2K subjects resulting in a total of more than 1M videos.
arXiv Detail & Related papers (2023-11-26T14:17:51Z) - Deepfake audio as a data augmentation technique for training automatic
speech to text transcription models [55.2480439325792]
We propose a framework that approaches data augmentation based on deepfake audio.
A dataset produced by Indians (in English) was selected, ensuring the presence of a single accent.
arXiv Detail & Related papers (2023-09-22T11:33:03Z) - Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning [50.28566759231076]
We propose an innovative, automatic approach to establish an audio dataset with high-quality captions.
Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs.
We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
arXiv Detail & Related papers (2023-09-20T17:59:32Z) - WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research [82.42802570171096]
We introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400k audio clips with paired captions.
Online-harvested raw descriptions are highly noisy and unsuitable for direct use in tasks such as automated audio captioning.
We propose a three-stage processing pipeline for filtering noisy data and generating high-quality captions, where ChatGPT, a large language model, is leveraged to filter and transform raw descriptions automatically.
arXiv Detail & Related papers (2023-03-30T14:07:47Z) - Separate What You Describe: Language-Queried Audio Source Separation [53.65665794338574]
We introduce the task of language-queried audio source separation (LASS)
LASS aims to separate a target source from an audio mixture based on a natural language query of the target source.
We propose LASS-Net, an end-to-end neural network that is learned to jointly process acoustic and linguistic information.
arXiv Detail & Related papers (2022-03-28T23:47:57Z) - Automatic Speech Recognition Datasets in Cantonese Language: A Survey
and a New Dataset [85.52036362232688]
Our dataset consists of 73.6 hours of clean read speech paired with transcripts, collected from Cantonese audiobooks from Hong Kong.
It combines philosophy, politics, education, culture, lifestyle and family domains, covering a wide range of topics.
We create a powerful and robust Cantonese ASR model by applying multi-dataset learning on MDCC and Common Voice zh-HK.
arXiv Detail & Related papers (2022-01-07T12:09:15Z) - HUI-Audio-Corpus-German: A high quality TTS dataset [0.0]
"HUI-Audio-Corpus-German" is a large, open-source dataset for TTS engines, created with a processing pipeline.
This dataset produces high quality audio to transcription alignments and decreases manual effort needed for creation.
arXiv Detail & Related papers (2021-06-11T10:59:09Z) - QuerYD: A video dataset with high-quality text and audio narrations [85.6468286746623]
We introduce QuerYD, a new large-scale dataset for retrieval and event localisation in video.
A unique feature of our dataset is the availability of two audio tracks for each video: the original audio, and a high-quality spoken description.
The dataset is based on YouDescribe, a volunteer project that assists visually-impaired people by attaching voiced narrations to existing YouTube videos.
arXiv Detail & Related papers (2020-11-22T17:33:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.