Augmented Datasheets for Speech Datasets and Ethical Decision-Making
- URL: http://arxiv.org/abs/2305.04672v1
- Date: Mon, 8 May 2023 12:49:04 GMT
- Title: Augmented Datasheets for Speech Datasets and Ethical Decision-Making
- Authors: Orestis Papakyriakopoulos, Anna Seo Gyeong Choi, Jerone Andrews,
Rebecca Bourke, William Thong, Dora Zhao, Alice Xiang, Allison Koenecke
- Abstract summary: Speech datasets are crucial for training Speech Language Technologies (SLT)
Lack of diversity of the underlying training data can lead to serious limitations in building equitable and robust SLT products.
There is often a lack of oversight on the underlying training data with regard to the ethics of such data collection.
- Score: 2.7106766103546236
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech datasets are crucial for training Speech Language Technologies (SLT);
however, the lack of diversity of the underlying training data can lead to
serious limitations in building equitable and robust SLT products, especially
along dimensions of language, accent, dialect, variety, and speech impairment -
and the intersectionality of speech features with socioeconomic and demographic
features. Furthermore, there is often a lack of oversight on the underlying
training data - commonly built on massive web-crawling and/or publicly
available speech - with regard to the ethics of such data collection. To
encourage standardized documentation of such speech data components, we
introduce an augmented datasheet for speech datasets, which can be used in
addition to "Datasheets for Datasets". We then exemplify the importance of each
question in our augmented datasheet based on in-depth literature reviews of
speech data used in domains such as machine learning, linguistics, and health.
Finally, we encourage practitioners - ranging from dataset creators to
researchers - to use our augmented datasheet to better define the scope,
properties, and limits of speech datasets, while also encouraging consideration
of data-subject protection and user community empowerment. Ethical dataset
creation is not a one-size-fits-all process, but dataset creators can use our
augmented datasheet to reflexively consider the social context of related SLT
applications and data sources in order to foster more inclusive SLT products
downstream.
Related papers
- Generating Data with Text-to-Speech and Large-Language Models for Conversational Speech Recognition [48.527630771422935]
We propose a synthetic data generation pipeline for multi-speaker conversational ASR.
We conduct evaluation by fine-tuning the Whisper ASR model for telephone and distant conversational speech settings.
arXiv Detail & Related papers (2024-08-17T14:47:05Z) - Leveraging GPT for the Generation of Multi-Platform Social Media Datasets for Research [0.0]
Social media datasets are essential for research on disinformation, influence operations, social sensing, hate speech detection, cyberbullying, and other significant topics.
Access to these datasets is often restricted due to costs and platform regulations.
This paper explores the potential of large language models to create lexically and semantically relevant social media datasets across multiple platforms.
arXiv Detail & Related papers (2024-07-11T09:12:39Z) - Speech Emotion Recognition under Resource Constraints with Data Distillation [64.36799373890916]
Speech emotion recognition (SER) plays a crucial role in human-computer interaction.
The emergence of edge devices in the Internet of Things presents challenges in constructing intricate deep learning models.
We propose a data distillation framework to facilitate efficient development of SER models in IoT applications.
arXiv Detail & Related papers (2024-06-21T13:10:46Z) - Open the Data! Chuvash Datasets [50.59120569845975]
We introduce four comprehensive datasets for the Chuvash language.
These datasets include a monolingual dataset, a parallel dataset with Russian, a parallel dataset with English, and an audio dataset.
arXiv Detail & Related papers (2024-05-31T07:51:19Z) - SER_AMPEL: a multi-source dataset for speech emotion recognition of
Italian older adults [58.49386651361823]
SER_AMPEL is a multi-source dataset for speech emotion recognition (SER)
It is collected with the aim of providing a reference for speech emotion recognition in case of Italian older adults.
The evidence of the need for such a dataset emerges from the analysis of the state of the art.
arXiv Detail & Related papers (2023-11-24T13:47:25Z) - Considerations for Ethical Speech Recognition Datasets [0.799536002595393]
We use automatic speech recognition as a case study and examine the properties that ethical speech datasets should possess towards responsible AI applications.
We showcase diversity issues, inclusion practices, and necessary considerations that can improve trained models.
We argue for the legal & privacy protection of data subjects, targeted data sampling corresponding to user demographics & needs, appropriate meta data that ensure explainability & accountability in cases of model failure.
arXiv Detail & Related papers (2023-05-03T12:38:14Z) - Cross-Lingual Dialogue Dataset Creation via Outline-Based Generation [70.81596088969378]
Cross-lingual Outline-based Dialogue dataset (termed COD) enables natural language understanding.
COD enables dialogue state tracking, and end-to-end dialogue modelling and evaluation in 4 diverse languages.
arXiv Detail & Related papers (2022-01-31T18:11:21Z) - Multimodal datasets: misogyny, pornography, and malignant stereotypes [2.8682942808330703]
We examine the recently released LAION-400M dataset, which is a CLIP-filtered dataset of Image-Alt-text pairs parsed from the Common-Crawl dataset.
We found that the dataset contains, troublesome and explicit images and text pairs of rape, pornography, malign stereotypes, racist and ethnic slurs, and other extremely problematic content.
arXiv Detail & Related papers (2021-10-05T11:47:27Z) - DeGAN : Data-Enriching GAN for Retrieving Representative Samples from a
Trained Classifier [58.979104709647295]
We bridge the gap between the abundance of available data and lack of relevant data, for the future learning tasks of a trained network.
We use the available data, that may be an imbalanced subset of the original training dataset, or a related domain dataset, to retrieve representative samples.
We demonstrate that data from a related domain can be leveraged to achieve state-of-the-art performance.
arXiv Detail & Related papers (2019-12-27T02:05:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.