The Casual Conversations v2 Dataset
- URL: http://arxiv.org/abs/2303.04838v1
- Date: Wed, 8 Mar 2023 19:17:05 GMT
- Title: The Casual Conversations v2 Dataset
- Authors: Bilal Porgali, V\'itor Albiero, Jordan Ryda, Cristian Canton Ferrer,
Caner Hazirbas
- Abstract summary: The dataset includes 26,467 videos of 5,567 unique paid participants, with an average of almost 5 videos per person.
The participants agreed for their data to be used in assessing fairness of AI models and provided self-reported age, gender, language/dialect, disability status, physical adornments, physical attributes and geo-location information.
- Score: 6.439761523935614
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper introduces a new large consent-driven dataset aimed at assisting
in the evaluation of algorithmic bias and robustness of computer vision and
audio speech models in regards to 11 attributes that are self-provided or
labeled by trained annotators. The dataset includes 26,467 videos of 5,567
unique paid participants, with an average of almost 5 videos per person,
recorded in Brazil, India, Indonesia, Mexico, Vietnam, Philippines, and the
USA, representing diverse demographic characteristics. The participants agreed
for their data to be used in assessing fairness of AI models and provided
self-reported age, gender, language/dialect, disability status, physical
adornments, physical attributes and geo-location information, while trained
annotators labeled apparent skin tone using the Fitzpatrick Skin Type and Monk
Skin Tone scales, and voice timbre. Annotators also labeled for different
recording setups and per-second activity annotations.
Related papers
- Towards measuring fairness in speech recognition: Fair-Speech dataset [14.703638352216132]
This paper introduces a novel dataset, Fair-Speech, a publicly released corpus to help researchers evaluate their ASR models for accuracy across a diverse set of self-reported demographic information.
Our dataset includes approximately 26.5K utterances in recorded speech by 593 people in the United States, who were paid to record and submit audios of themselves saying voice commands.
arXiv Detail & Related papers (2024-08-22T20:55:17Z) - Spoken Stereoset: On Evaluating Social Bias Toward Speaker in Speech Large Language Models [50.40276881893513]
This study introduces Spoken Stereoset, a dataset specifically designed to evaluate social biases in Speech Large Language Models (SLLMs)
By examining how different models respond to speech from diverse demographic groups, we aim to identify these biases.
The findings indicate that while most models show minimal bias, some still exhibit slightly stereotypical or anti-stereotypical tendencies.
arXiv Detail & Related papers (2024-08-14T16:55:06Z) - Sonos Voice Control Bias Assessment Dataset: A Methodology for Demographic Bias Assessment in Voice Assistants [10.227469020901232]
This paper introduces the Sonos Voice Control Bias Assessment dataset.
1,038 speakers, 166 hours, 170k audio samples, with 9,040 unique labelled transcripts.
Results show statistically significant differences in performance across age, dialectal region and ethnicity.
arXiv Detail & Related papers (2024-05-14T12:53:32Z) - DeVAn: Dense Video Annotation for Video-Language Models [68.70692422636313]
We present a novel human annotated dataset for evaluating the ability for visual-language models to generate descriptions for real-world video clips.
The dataset contains 8.5K YouTube video clips of 20-60 seconds in duration and covers a wide range of topics and interests.
arXiv Detail & Related papers (2023-10-08T08:02:43Z) - A Deep Dive into the Disparity of Word Error Rates Across Thousands of
NPTEL MOOC Videos [4.809236881780707]
We describe the curation of a massive speech dataset of 8740 hours consisting of $sim9.8$K technical lectures in the English language along with their transcripts delivered by instructors representing various parts of Indian demography.
We use the curated dataset to measure the existing disparity in YouTube Automatic Captions and OpenAI Whisper model performance across the diverse demographic traits of speakers in India.
arXiv Detail & Related papers (2023-07-20T05:03:00Z) - Slovo: Russian Sign Language Dataset [83.93252084624997]
This paper presents the Russian Sign Language (RSL) video dataset Slovo, produced using crowdsourcing platforms.
The dataset contains 20,000 FullHD recordings, divided into 1,000 classes of isolated RSL gestures received by 194 signers.
arXiv Detail & Related papers (2023-05-23T21:00:42Z) - APES: Audiovisual Person Search in Untrimmed Video [87.4124877066541]
We present the Audiovisual Person Search dataset (APES)
APES contains over 1.9K identities labeled along 36 hours of video.
A key property of APES is that it includes dense temporal annotations that link faces to speech segments of the same identity.
arXiv Detail & Related papers (2021-06-03T08:16:42Z) - Towards measuring fairness in AI: the Casual Conversations dataset [9.246092246471955]
Our dataset is composed of 3,011 subjects and contains over 45,000 videos, with an average of 15 videos per person.
The videos were recorded in multiple U.S. states with a diverse set of adults in various age, gender and apparent skin tone groups.
arXiv Detail & Related papers (2021-04-06T22:48:22Z) - Vyaktitv: A Multimodal Peer-to-Peer Hindi Conversations based Dataset
for Personality Assessment [50.15466026089435]
We present a novel peer-to-peer Hindi conversation dataset- Vyaktitv.
It consists of high-quality audio and video recordings of the participants, with Hinglish textual transcriptions for each conversation.
The dataset also contains a rich set of socio-demographic features, like income, cultural orientation, amongst several others, for all the participants.
arXiv Detail & Related papers (2020-08-31T17:44:28Z) - The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines [88.47608066382267]
We detail how this large-scale dataset was captured by 32 participants in their native kitchen environments.
Recording took place in 4 countries by participants belonging to 10 different nationalities.
Our dataset features 55 hours of video consisting of 11.5M frames, which we densely labelled for a total of 39.6K action segments and 454.2K object bounding boxes.
arXiv Detail & Related papers (2020-04-29T21:57:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.