The Biased Journey of MSD_AUDIO.ZIP
- URL: http://arxiv.org/abs/2308.16389v3
- Date: Sat, 2 Dec 2023 02:01:49 GMT
- Title: The Biased Journey of MSD_AUDIO.ZIP
- Authors: Haven Kim, Keunwoo Choi, Mateusz Modrzejewski, Cynthia C. S. Liem
- Abstract summary: Access to the Million Song dataset has become restricted to those within certain affiliations that are connected peer-to-peer.
We draw insights from the experiences of 22 individuals who either attempted to access the data or played a role in its creation.
- Score: 5.695436409400152
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The equitable distribution of academic data is crucial for ensuring equal
research opportunities, and ultimately further progress. Yet, due to the
complexity of using the API for audio data that corresponds to the Million Song
Dataset along with its misreporting (before 2016) and the discontinuation of
this API (after 2016), access to this data has become restricted to those
within certain affiliations that are connected peer-to-peer. In this paper, we
delve into this issue, drawing insights from the experiences of 22 individuals
who either attempted to access the data or played a role in its creation. With
this, we hope to initiate more critical dialogue and more thoughtful
consideration with regard to access privilege in the MIR community.
Related papers
- Post-Post-API Age: Studying Digital Platforms in Scant Data Access Times [5.997153455641738]
The "post-API age" has sparked optimism about increased platform transparency and renewed opportunities for comprehensive research on digital platforms.<n>However, it remains unclear whether platforms provide adequate data access in practice.<n>Our findings reveal significant challenges in accessing social media data.<n>These challenges have exacerbated existing institutional, regional, and financial inequities in data access.
arXiv Detail & Related papers (2025-05-15T00:47:06Z) - Multi-Platform Aggregated Dataset of Online Communities (MADOC) [64.45797970830233]
MADOC aggregates and standardizes data from Bluesky, Koo, Reddit, and Voat (2012-2024), containing 18.9 million posts, 236 million comments, and 23.1 million unique users.
The dataset enables comparative studies of toxic behavior evolution across platforms through standardized interaction records and sentiment analysis.
arXiv Detail & Related papers (2025-01-22T14:02:11Z) - How to Drill Into Silos: Creating a Free-to-Use Dataset of Data Subject Access Packages [0.0]
European Union's General Data Protection Regulation strengthened data subjects' right to access personal data.
Subjects' possibilities for actually using controller-provided subject access request packages (SARPs) are severely limited so far.
This dataset is publicly provided and shall, in the future, serve as a starting point for researching and comparing novel approaches for practically viable use of SARPs.
arXiv Detail & Related papers (2024-07-05T12:39:51Z) - Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs)
We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs.
We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z) - Data Acquisition: A New Frontier in Data-centric AI [65.90972015426274]
We first present an investigation of current data marketplaces, revealing lack of platforms offering detailed information about datasets.
We then introduce the DAM challenge, a benchmark to model the interaction between the data providers and acquirers.
Our evaluation of the submitted strategies underlines the need for effective data acquisition strategies in Machine Learning.
arXiv Detail & Related papers (2023-11-22T22:15:17Z) - Assessing Scientific Contributions in Data Sharing Spaces [64.16762375635842]
This paper introduces the SCIENCE-index, a blockchain-based metric measuring a researcher's scientific contributions.
To incentivize researchers to share their data, the SCIENCE-index is augmented to include a data-sharing parameter.
Our model is evaluated by comparing the distribution of its output for geographically diverse researchers to that of the h-index.
arXiv Detail & Related papers (2023-03-18T19:17:47Z) - Contributing to Accessibility Datasets: Reflections on Sharing Study
Data by Blind People [14.625384963263327]
We present a pair of studies where 13 blind participants engage in data capturing activities.
We see how different factors influence blind participants' willingness to share study data as they assess risk-benefit tradeoffs.
The majority support sharing of their data to improve technology but also express concerns over commercial use, associated metadata, and the lack of transparency about the impact of their data.
arXiv Detail & Related papers (2023-03-09T00:42:18Z) - Algorithmic Fairness Datasets: the Story so Far [68.45921483094705]
Data-driven algorithms are studied in diverse domains to support critical decisions, directly impacting people's well-being.
A growing community of researchers has been investigating the equity of existing algorithms and proposing novel ones, advancing the understanding of risks and opportunities of automated decision-making for historically disadvantaged populations.
Progress in fair Machine Learning hinges on data, which can be appropriately used only if adequately documented.
Unfortunately, the algorithmic fairness community suffers from a collective data documentation debt caused by a lack of information on specific resources (opacity) and scatteredness of available information (sparsity)
arXiv Detail & Related papers (2022-02-03T17:25:46Z) - Yes-Yes-Yes: Donation-based Peer Reviewing Data Collection for ACL
Rolling Review and Beyond [58.71736531356398]
We present an in-depth discussion of peer reviewing data, outline the ethical and legal desiderata for peer reviewing data collection, and propose the first continuous, donation-based data collection workflow.
We report on the ongoing implementation of this workflow at the ACL Rolling Review and deliver the first insights obtained with the newly collected data.
arXiv Detail & Related papers (2022-01-27T11:02:43Z) - Retiring Adult: New Datasets for Fair Machine Learning [47.27417042497261]
UCI Adult has served as the basis for the development and comparison of many algorithmic fairness interventions.
We reconstruct a superset of the UCI Adult data from available US Census sources and reveal idiosyncrasies of the UCI Adult dataset that limit its external validity.
Our primary contribution is a suite of new datasets that extend the existing data ecosystem for research on fair machine learning.
arXiv Detail & Related papers (2021-08-10T19:19:41Z) - Digital trace data collection through data donation [0.4499833362998487]
Article 15 of the EU's General Data Protection Regulation: 2018 mandates individuals have electronic access to their personal data.
All major digital platforms now comply with law by users with "data download packages" (DDPs)
Through all data collected by public and private entities, citizens' digital life can be obtained and analyzed to answer social-scientific questions.
We provide a blueprint for digital trace data collection using DDPs, and devise a "total error framework" for such projects.
arXiv Detail & Related papers (2020-11-13T11:19:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.