Open Source MagicData-RAMC: A Rich Annotated Mandarin
Conversational(RAMC) Speech Dataset
- URL: http://arxiv.org/abs/2203.16844v1
- Date: Thu, 31 Mar 2022 07:01:06 GMT
- Title: Open Source MagicData-RAMC: A Rich Annotated Mandarin
Conversational(RAMC) Speech Dataset
- Authors: Zehui Yang, Yifan Chen, Lei Luo, Runyan Yang, Lingxuan Ye, Gaofeng
Cheng, Ji Xu, Yaohui Jin, Qingqing Zhang, Pengyuan Zhang, Lei Xie, Yonghong
Yan
- Abstract summary: This paper introduces a high-quality rich annotated Mandarin conversational (RAMC) speech dataset called MagicData-RAMC.
The MagicData-RAMC corpus contains 180 hours of conversational speech data recorded from native speakers of Mandarin Chinese over mobile phones with a sampling rate of 16 kHz.
- Score: 51.75617364782418
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces a high-quality rich annotated Mandarin conversational
(RAMC) speech dataset called MagicData-RAMC. The MagicData-RAMC corpus contains
180 hours of conversational speech data recorded from native speakers of
Mandarin Chinese over mobile phones with a sampling rate of 16 kHz. The dialogs
in MagicData-RAMC are classified into 15 diversified domains and tagged with
topic labels, ranging from science and technology to ordinary life. Accurate
transcription and precise speaker voice activity timestamps are manually
labeled for each sample. Speakers' detailed information is also provided. As a
Mandarin speech dataset designed for dialog scenarios with high quality and
rich annotations, MagicData-RAMC enriches the data diversity in the Mandarin
speech community and allows extensive research on a series of speech-related
tasks, including automatic speech recognition, speaker diarization, topic
detection, keyword search, text-to-speech, etc. We also conduct several
relevant tasks and provide experimental results to help evaluate the dataset.
Related papers
- CS-Dialogue: A 104-Hour Dataset of Spontaneous Mandarin-English Code-Switching Dialogues for Speech Recognition [17.806195208457428]
Code-switching is the alternation between two or more languages within a single conversation.
Existing Mandarin-English code-switching datasets often suffer from limitations in size, spontaneity, and the lack of full-length dialogue recordings with transcriptions.
This paper introduces CS-Dialogue, a novel large-scale Mandarin-English code-switching speech dataset comprising 104 hours of spontaneous conversations from 200 speakers.
arXiv Detail & Related papers (2025-02-26T07:59:55Z) - Identifying Speakers in Dialogue Transcripts: A Text-based Approach Using Pretrained Language Models [83.7506131809624]
We introduce an approach to identifying speaker names in dialogue transcripts, a crucial task for enhancing content accessibility and searchability in digital media archives.
We present a novel, large-scale dataset derived from the MediaSum corpus, encompassing transcripts from a wide range of media sources.
We propose novel transformer-based models tailored for SpeakerID, leveraging contextual cues within dialogues to accurately attribute speaker names.
arXiv Detail & Related papers (2024-07-16T18:03:58Z) - EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation [83.29199726650899]
The EARS dataset comprises 107 speakers from diverse backgrounds, totaling in 100 hours of clean, anechoic speech data.
The dataset covers a large range of different speaking styles, including emotional speech, different reading styles, non-verbal sounds, and conversational freeform speech.
We benchmark various methods for speech enhancement and dereverberation on the dataset and evaluate their performance through a set of instrumental metrics.
arXiv Detail & Related papers (2024-06-10T11:28:29Z) - Advancing Speech Translation: A Corpus of Mandarin-English Conversational Telephone Speech [4.924682400857061]
This paper introduces a set of English translations for a 123-hour subset of the CallHome Mandarin Chinese data and the HKUST Mandarin Telephone Speech data.
We demonstrate that fine-tuning a general-purpose translation model to our Mandarin-English conversational telephone speech training set improves target-domain BLEU by more than 8 points.
arXiv Detail & Related papers (2024-03-25T21:08:06Z) - MD3: The Multi-Dialect Dataset of Dialogues [20.144004030947507]
We introduce a new dataset of conversational speech representing English from India, Nigeria, and the United States.
The dataset includes more than 20 hours of audio and more than 200,000 orthographically-transcribed tokens.
arXiv Detail & Related papers (2023-05-19T00:14:10Z) - The Conversational Short-phrase Speaker Diarization (CSSD) Task:
Dataset, Evaluation Metric and Baselines [63.86406909879314]
This paper describes the Conversational Short-phrases Speaker Diarization (CSSD) task.
It consists of training and testing datasets, evaluation metric and baselines.
In the metric aspect, we design the new conversational DER (CDER) evaluation metric, which calculates the SD accuracy at the utterance level.
arXiv Detail & Related papers (2022-08-17T03:26:23Z) - MEG-MASC: a high-quality magneto-encephalography dataset for evaluating
natural speech processing [1.345669927504424]
The "MEG-MASC" dataset provides a curated set of raw magnetoencephalography (MEG) recordings of 27 English speakers.
We time-stamp the onset and offset of each word and phoneme in the metadata of the recording, and organize the dataset according to the 'Brain Imaging Data Structure' (BIDS)
This data collection provides a suitable benchmark to large-scale encoding and decoding analyses of temporally-resolved brain responses to speech.
arXiv Detail & Related papers (2022-07-26T19:17:01Z) - Automatic Speech Recognition Datasets in Cantonese Language: A Survey
and a New Dataset [85.52036362232688]
Our dataset consists of 73.6 hours of clean read speech paired with transcripts, collected from Cantonese audiobooks from Hong Kong.
It combines philosophy, politics, education, culture, lifestyle and family domains, covering a wide range of topics.
We create a powerful and robust Cantonese ASR model by applying multi-dataset learning on MDCC and Common Voice zh-HK.
arXiv Detail & Related papers (2022-01-07T12:09:15Z) - KdConv: A Chinese Multi-domain Dialogue Dataset Towards Multi-turn
Knowledge-driven Conversation [66.99734491847076]
We propose a Chinese multi-domain knowledge-driven conversation dataset, KdConv, which grounds the topics in multi-turn conversations to knowledge graphs.
Our corpus contains 4.5K conversations from three domains (film, music, and travel), and 86K utterances with an average turn number of 19.0.
arXiv Detail & Related papers (2020-04-08T16:25:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.