Real-World En Call Center Transcripts Dataset with PII Redaction
- URL: http://arxiv.org/abs/2507.02958v1
- Date: Mon, 30 Jun 2025 03:41:02 GMT
- Title: Real-World En Call Center Transcripts Dataset with PII Redaction
- Authors: Ha Dao, Gaurav Chawla, Raghu Banda, Caleb DeLeeuw,
- Abstract summary: CallCenterEN is a large-scale (91,706 conversations, corresponding to 10448 audio hours) real-world English call center transcript dataset.<n>This is the largest release to-date of open source call center transcript data of this kind.<n>The dataset includes inbound and outbound calls between agents and customers, with accents from India, the Philippines and the United States.
- Score: 0.8077903172320928
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We introduce CallCenterEN, a large-scale (91,706 conversations, corresponding to 10448 audio hours), real-world English call center transcript dataset designed to support research and development in customer support and sales AI systems. This is the largest release to-date of open source call center transcript data of this kind. The dataset includes inbound and outbound calls between agents and customers, with accents from India, the Philippines and the United States. The dataset includes high-quality, PII-redacted human-readable transcriptions. All personally identifiable information (PII) has been rigorously removed to ensure compliance with global data protection laws. The audio is not included in the public release due to biometric privacy concerns. Given the scarcity of publicly available real-world call center datasets, CallCenterEN fills a critical gap in the landscape of available ASR corpora, and is released under a CC BY-NC 4.0 license for non-commercial research use.
Related papers
- WAXAL: A Large-Scale Multilingual African Language Speech Corpus [12.433885475371035]
WAXAL is a large-scale, openly accessible speech dataset for 21 languages representing over 100 million speakers.<n>The collection consists of two main components: an Automated Speech Recognition (ASR) dataset containing approximately 1,250 hours of transcribed, natural speech from a diverse range of speakers, and a Text-to-Speech (TTS) dataset with over 180 hours of high-quality, single-speaker recordings reading phonetically balanced scripts.
arXiv Detail & Related papers (2026-02-02T19:49:19Z) - How Sovereign Is Sovereign Compute? A Review of 775 Non-U.S. Data Centers [0.0]
This paper estimates how often data centers could be subject to foreign legal authorities due to the nationality of the data center operators.<n>We find that U.S. companies operate 48% of all non-U.S. data center projects in our dataset when weighted by investment value.
arXiv Detail & Related papers (2025-07-30T22:58:42Z) - IndieFake Dataset: A Benchmark Dataset for Audio Deepfake Detection [0.4451479907610763]
Deepfake technology offers benefits like AI assistants, better accessibility for speech impairments, and enhanced entertainment.<n>It also poses significant risks to security, privacy, and trust in digital communications.<n>Existing datasets lack diverse ethnic accents, making them inadequate for many real-world scenarios.<n>This work introduces the IndieFake dataset (IFD), featuring 27.17 hours of bonafide and deepfake audio from 50 English speaking Indian speakers.
arXiv Detail & Related papers (2025-06-23T18:10:06Z) - Code-Switched Urdu ASR for Noisy Telephonic Environment using Data
Centric Approach with Hybrid HMM and CNN-TDNN [0.0]
Urdu is the $10th$ most widely spoken language in the world, with 231,295,440 worldwide still remains a resource constrained language in ASR.
This paper describes an implementation framework of a resource efficient Automatic Speech Recognition/ Speech to Text System in a noisy call-center environment.
arXiv Detail & Related papers (2023-07-24T13:04:21Z) - Does Collaborative Human-LM Dialogue Generation Help Information
Extraction from Human Dialogues? [55.28340832822234]
Problem-solving human dialogues in real applications can be much more complex than existing Wizard-of-Oz collections.
We introduce a human-in-the-loop dialogue generation framework capable of synthesizing realistic dialogues.
arXiv Detail & Related papers (2023-07-13T20:02:50Z) - TGDataset: Collecting and Exploring the Largest Telegram Channels Dataset [57.2282378772772]
This paper presents the TGDataset, a new dataset that includes 120,979 Telegram channels and over 400 million messages.<n>We analyze the languages spoken within our dataset and the topic covered by English channels.<n>In addition to the raw dataset, we released the scripts we used to analyze the dataset and the list of channels belonging to the network of a new conspiracy theory called Sabmyk.
arXiv Detail & Related papers (2023-03-09T15:42:38Z) - Information Extraction and Human-Robot Dialogue towards Real-life Tasks:
A Baseline Study with the MobileCS Dataset [52.22314870976088]
The SereTOD challenge is organized and releases the MobileCS dataset, which consists of real-world dialog transcripts between real users and customer-service staffs from China Mobile.
Based on the MobileCS dataset, the SereTOD challenge has two tasks, not only evaluating the construction of the dialogue system itself, but also examining information extraction from dialog transcripts.
This paper mainly presents a baseline study of the two tasks with the MobileCS dataset.
arXiv Detail & Related papers (2022-09-27T15:30:43Z) - Developing a Production System for Purpose of Call Detection in Business
Phone Conversations [1.4450257955652834]
We describe our implementation of a commercial system to detect Purpose of Call statements in English business call transcripts in real time.
We present a detailed analysis of types of Purpose of Call statements and language patterns related to them, discuss an approach to collect rich training data by bootstrapping from a set of rules to a neural model.
The model achieved 88.6 F1 on average in various types of business calls when tested on real life data and has low inference time.
arXiv Detail & Related papers (2022-05-13T21:45:54Z) - Automatic Speech Recognition Datasets in Cantonese Language: A Survey
and a New Dataset [85.52036362232688]
Our dataset consists of 73.6 hours of clean read speech paired with transcripts, collected from Cantonese audiobooks from Hong Kong.
It combines philosophy, politics, education, culture, lifestyle and family domains, covering a wide range of topics.
We create a powerful and robust Cantonese ASR model by applying multi-dataset learning on MDCC and Common Voice zh-HK.
arXiv Detail & Related papers (2022-01-07T12:09:15Z) - The People's Speech: A Large-Scale Diverse English Speech Recognition
Dataset for Commercial Usage [1.5213617014998604]
We show that a model trained on this dataset achieves a 9.98% word error rate on Librispeech's test-clean test set.
We discuss the legal and ethical issues surrounding the creation of a sizable machine learning corpora.
arXiv Detail & Related papers (2021-11-17T19:14:40Z) - Conversations with Search Engines: SERP-based Conversational Response
Generation [77.1381159789032]
We create a suitable dataset, the Search as a Conversation (SaaC) dataset, for the development of pipelines for conversations with search engines.
We also develop a state-of-the-art pipeline for conversations with search engines, the Conversations with Search Engines (CaSE) using this dataset.
CaSE enhances the state-of-the-art by introducing a supporting token identification module and aprior-aware pointer generator.
arXiv Detail & Related papers (2020-04-29T13:07:53Z) - ClovaCall: Korean Goal-Oriented Dialog Speech Corpus for Automatic
Speech Recognition of Contact Centers [23.076908473357577]
We introduce a new large-scale Korean call-based speech corpus under a goal-oriented dialog scenario from more than 11,000 people.
ClovaCall includes approximately 60,000 pairs of a short sentence and its corresponding spoken utterance in a restaurant reservation domain.
We validate the effectiveness of our dataset with intensive experiments using two standard ASR models.
arXiv Detail & Related papers (2020-04-20T15:12:29Z) - CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English.
It diversified with over 11,000 speakers and over 60 accents.
CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.