From Data Creator to Data Reuser: Distance Matters
- URL: http://arxiv.org/abs/2402.07926v1
- Date: Mon, 5 Feb 2024 18:16:04 GMT
- Title: From Data Creator to Data Reuser: Distance Matters
- Authors: Christine L. Borgman, Paul T. Groth
- Abstract summary: Investment in data management could be made more wisely by considering who might reuse data, how, why, and when.
Data creators cannot anticipate all possible reuses or reusers.
We develop the theoretical construct of distance between data creator and data reuser.
- Score: 1.000779758350696
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sharing research data is complex, labor-intensive, expensive, and requires
infrastructure investments by multiple stakeholders. Open science policies
focus on data release rather than on data reuse, yet reuse is also difficult,
expensive, and may never occur. Investments in data management could be made
more wisely by considering who might reuse data, how, why, for what purposes,
and when. Data creators cannot anticipate all possible reuses or reusers; our
goal is to identify factors that may aid stakeholders in deciding how to invest
in research data, how to identify potential reuses and reusers, and how to
improve data exchange processes. Drawing upon empirical studies of data sharing
and reuse, we develop the theoretical construct of distance between data
creator and data reuser, identifying six distance dimensions that influence the
ability to transfer knowledge effectively: domain, methods, collaboration,
curation, purposes, and time and temporality. These dimensions are primarily
social in character, with associated technical aspects that can decrease - or
increase - distances between creators and reusers. We identify the order of
expected influence on data reuse and ways in which the six dimensions are
interdependent. Our theoretical framing of the distance between data creators
and prospective reusers leads to recommendations to four categories of
stakeholders on how to make data sharing and reuse more effective: data
creators, data reusers, data archivists, and funding agencies.
Related papers
- Insights from an experiment crowdsourcing data from thousands of US Amazon users: The importance of transparency, money, and data use [6.794366017852433]
This paper shares an innovative approach to crowdsourcing user data to collect otherwise inaccessible Amazon purchase histories, spanning 5 years, from more than 5000 US users.
We developed a data collection tool that prioritizes participant consent and includes an experimental study design.
Experiment results (N=6325) reveal both monetary incentives and transparency can significantly increase data sharing.
arXiv Detail & Related papers (2024-04-19T20:45:19Z) - Efficient Data Collection for Robotic Manipulation via Compositional Generalization [70.76782930312746]
We show that policies can compose environmental factors from their data to succeed when encountering unseen factor combinations.
We propose better in-domain data collection strategies that exploit composition.
We provide videos at http://iliad.stanford.edu/robot-data-comp/.
arXiv Detail & Related papers (2024-03-08T07:15:38Z) - A Survey on Data Selection for Language Models [151.6210632830082]
Data selection methods aim to determine which data points to include in a training dataset.
Deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive.
Few organizations have the resources for extensive data selection research.
arXiv Detail & Related papers (2024-02-26T18:54:35Z) - Ontologies for increasing the FAIRness of plant research data [0.0]
Onologies provide concepts for a particular domain as well as relationships between concepts.
By tagging with data terms data becomes both human machine interpretable, allowing increased reuse and interoperability.
We outline the most relevant to the fundamental plant sciences and how they can be used to annotate data related to plant-specific experiments.
arXiv Detail & Related papers (2023-08-25T13:08:26Z) - Privacy-Preserving Graph Machine Learning from Data to Computation: A
Survey [67.7834898542701]
We focus on reviewing privacy-preserving techniques of graph machine learning.
We first review methods for generating privacy-preserving graph data.
Then we describe methods for transmitting privacy-preserved information.
arXiv Detail & Related papers (2023-07-10T04:30:23Z) - The Dimensions of Data Labor: A Road Map for Researchers, Activists, and
Policymakers to Empower Data Producers [14.392208044851976]
Data producers have little say in what data is captured, how it is used, or who it benefits.
Organizations with the ability to access and process this data, e.g. OpenAI and Google, possess immense power in shaping the technology landscape.
By synthesizing related literature that reconceptualizes the production of data for computing as data labor'', we outline opportunities for researchers, policymakers, and activists to empower data producers.
arXiv Detail & Related papers (2023-05-22T17:11:22Z) - Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic
Data [91.52783572568214]
Synthetic data may become a dominant force in the machine learning world, promising a future where datasets can be tailored to individual needs.
We discuss which fundamental challenges the community needs to overcome for wider relevance and application of synthetic data.
arXiv Detail & Related papers (2023-04-07T16:38:40Z) - Towards Avoiding the Data Mess: Industry Insights from Data Mesh Implementations [1.5029560229270191]
Data mesh is a socio-technical, decentralized, distributed concept for enterprise data management.
We conduct 15 semi-structured interviews with industry experts.
Our findings synthesize insights from industry experts and provide researchers and professionals with preliminary guidelines for the successful adoption of data mesh.
arXiv Detail & Related papers (2023-02-03T13:09:57Z) - DataPerf: Benchmarks for Data-Centric AI Development [81.03754002516862]
DataPerf is a community-led benchmark suite for evaluating ML datasets and data-centric algorithms.
We provide an open, online platform with multiple rounds of challenges to support this iterative development.
The benchmarks, online evaluation platform, and baseline implementations are open source.
arXiv Detail & Related papers (2022-07-20T17:47:54Z) - From Data to Knowledge to Action: A Global Enabler for the 21st Century [26.32590947516587]
A confluence of advances in the computer and mathematical sciences has unleashed unprecedented capabilities for enabling true evidence-based decision making.
These capabilities are making possible the large-scale capture of data and the transformation of that data into insights and recommendations.
The shift of commerce, science, education, art, and entertainment to the web makes available unprecedented quantities of structured and unstructured databases about human activities.
arXiv Detail & Related papers (2020-07-31T19:19:42Z) - DeGAN : Data-Enriching GAN for Retrieving Representative Samples from a
Trained Classifier [58.979104709647295]
We bridge the gap between the abundance of available data and lack of relevant data, for the future learning tasks of a trained network.
We use the available data, that may be an imbalanced subset of the original training dataset, or a related domain dataset, to retrieve representative samples.
We demonstrate that data from a related domain can be leveraged to achieve state-of-the-art performance.
arXiv Detail & Related papers (2019-12-27T02:05:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.