Related papers: From Data Creator to Data Reuser: Distance Matters

From Data Creator to Data Reuser: Distance Matters

URL: http://arxiv.org/abs/2402.07926v1
Date: Mon, 5 Feb 2024 18:16:04 GMT
Title: From Data Creator to Data Reuser: Distance Matters
Authors: Christine L. Borgman, Paul T. Groth
Abstract summary: Investment in data management could be made more wisely by considering who might reuse data, how, why, and when. Data creators cannot anticipate all possible reuses or reusers. We develop the theoretical construct of distance between data creator and data reuser.
Score: 1.000779758350696
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sharing research data is complex, labor-intensive, expensive, and requires infrastructure investments by multiple stakeholders. Open science policies focus on data release rather than on data reuse, yet reuse is also difficult, expensive, and may never occur. Investments in data management could be made more wisely by considering who might reuse data, how, why, for what purposes, and when. Data creators cannot anticipate all possible reuses or reusers; our goal is to identify factors that may aid stakeholders in deciding how to invest in research data, how to identify potential reuses and reusers, and how to improve data exchange processes. Drawing upon empirical studies of data sharing and reuse, we develop the theoretical construct of distance between data creator and data reuser, identifying six distance dimensions that influence the ability to transfer knowledge effectively: domain, methods, collaboration, curation, purposes, and time and temporality. These dimensions are primarily social in character, with associated technical aspects that can decrease - or increase - distances between creators and reusers. We identify the order of expected influence on data reuse and ways in which the six dimensions are interdependent. Our theoretical framing of the distance between data creators and prospective reusers leads to recommendations to four categories of stakeholders on how to make data sharing and reuse more effective: data creators, data reusers, data archivists, and funding agencies.

Related papers

The Landscape of Data Reuse in Interactive Information Retrieval: Motivations, Sources, and Evaluation of Reusability [5.257245308437576]
This study investigated the data reuse practices of experienced researchers from the area of Interactive Information Retrieval (IIR) studies. We conducted 21 semi-structured in-depth interviews with IIR researchers from varying demographic backgrounds, institutions, and stages of careers on their motivations, experiences, and concerns over data reuse.
arXiv Detail & Related papers (2024-11-23T03:15:31Z)
Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs) We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs. We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z)
Insights from an experiment crowdsourcing data from thousands of US Amazon users: The importance of transparency, money, and data use [6.794366017852433]
This paper shares an innovative approach to crowdsourcing user data to collect otherwise inaccessible Amazon purchase histories, spanning 5 years, from more than 5000 US users. We developed a data collection tool that prioritizes participant consent and includes an experimental study design. Experiment results (N=6325) reveal both monetary incentives and transparency can significantly increase data sharing.
arXiv Detail & Related papers (2024-04-19T20:45:19Z)
Efficient Data Collection for Robotic Manipulation via Compositional Generalization [70.76782930312746]
We show that policies can compose environmental factors from their data to succeed when encountering unseen factor combinations. We propose better in-domain data collection strategies that exploit composition. We provide videos at http://iliad.stanford.edu/robot-data-comp/.
arXiv Detail & Related papers (2024-03-08T07:15:38Z)
A Survey on Data Selection for Language Models [148.300726396877]
Data selection methods aim to determine which data points to include in a training dataset. Deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive. Few organizations have the resources for extensive data selection research.
arXiv Detail & Related papers (2024-02-26T18:54:35Z)
Ontologies for increasing the FAIRness of plant research data [0.0]
Onologies provide concepts for a particular domain as well as relationships between concepts. By tagging with data terms data becomes both human machine interpretable, allowing increased reuse and interoperability. We outline the most relevant to the fundamental plant sciences and how they can be used to annotate data related to plant-specific experiments.
arXiv Detail & Related papers (2023-08-25T13:08:26Z)
The Dimensions of Data Labor: A Road Map for Researchers, Activists, and Policymakers to Empower Data Producers [14.392208044851976]
Data producers have little say in what data is captured, how it is used, or who it benefits. Organizations with the ability to access and process this data, e.g. OpenAI and Google, possess immense power in shaping the technology landscape. By synthesizing related literature that reconceptualizes the production of data for computing as data labor'', we outline opportunities for researchers, policymakers, and activists to empower data producers.
arXiv Detail & Related papers (2023-05-22T17:11:22Z)
Assessing Scientific Contributions in Data Sharing Spaces [64.16762375635842]
This paper introduces the SCIENCE-index, a blockchain-based metric measuring a researcher's scientific contributions. To incentivize researchers to share their data, the SCIENCE-index is augmented to include a data-sharing parameter. Our model is evaluated by comparing the distribution of its output for geographically diverse researchers to that of the h-index.
arXiv Detail & Related papers (2023-03-18T19:17:47Z)
DataPerf: Benchmarks for Data-Centric AI Development [81.03754002516862]
DataPerf is a community-led benchmark suite for evaluating ML datasets and data-centric algorithms. We provide an open, online platform with multiple rounds of challenges to support this iterative development. The benchmarks, online evaluation platform, and baseline implementations are open source.
arXiv Detail & Related papers (2022-07-20T17:47:54Z)
Research Trends and Applications of Data Augmentation Algorithms [77.34726150561087]
We identify the main areas of application of data augmentation algorithms, the types of algorithms used, significant research trends, their progression over time and research gaps in data augmentation literature. We expect readers to understand the potential of data augmentation, as well as identify future research directions and open questions within data augmentation research.
arXiv Detail & Related papers (2022-07-18T11:38:32Z)
Subdivisions and Crossroads: Identifying Hidden Community Structures in a Data Archive's Citation Network [1.6631602844999724]
This paper analyzes the community structure of an authoritative network of datasets cited in academic publications. We identify communities of social science datasets and fields of research connected through shared data use. Our research reveals the hidden structure of data reuse and demonstrates how interdisciplinary research communities organize around datasets as shared scientific inputs.
arXiv Detail & Related papers (2022-05-17T14:18:49Z)
DeGAN : Data-Enriching GAN for Retrieving Representative Samples from a Trained Classifier [58.979104709647295]
We bridge the gap between the abundance of available data and lack of relevant data, for the future learning tasks of a trained network. We use the available data, that may be an imbalanced subset of the original training dataset, or a related domain dataset, to retrieve representative samples. We demonstrate that data from a related domain can be leveraged to achieve state-of-the-art performance.
arXiv Detail & Related papers (2019-12-27T02:05:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.