From Data Creator to Data Reuser: Distance Matters
- URL: http://arxiv.org/abs/2402.07926v3
- Date: Fri, 07 Feb 2025 17:34:37 GMT
- Title: From Data Creator to Data Reuser: Distance Matters
- Authors: Christine L. Borgman, Paul T. Groth,
- Abstract summary: Open science policies focus more heavily on data sharing than on reuse.
Both are complex, labor-intensive, expensive, and require infrastructure investments by multiple stakeholders.
Value of data reuse lies in relationships between creators and reusers.
- Score: 0.847136673632881
- License:
- Abstract: Sharing research data is necessary, but not sufficient, for data reuse. Open science policies focus more heavily on data sharing than on reuse, yet both are complex, labor-intensive, expensive, and require infrastructure investments by multiple stakeholders. The value of data reuse lies in relationships between creators and reusers. By addressing knowledge exchange, rather than mere transactions between stakeholders, investments in data management and knowledge infrastructures can be made more wisely. Drawing upon empirical studies of data sharing and reuse, we develop the metaphor of distance between data creator and data reuser, identifying six dimensions of distance that influence the ability to transfer knowledge effectively: domain, methods, collaboration, curation, purposes, and time and temporality. We explore how social and socio-technical aspects of these dimensions may decrease -- or increase -- distances to be traversed between creators and reusers. Our theoretical framing of the distance between data creators and prospective reusers leads to recommendations to four categories of stakeholders on how to make data sharing and reuse more effective: data creators, data reusers, data archivists, and funding agencies. 'It takes a village' to share research data -- and a village to reuse data. Our aim is to provoke new research questions, new research, and new investments in effective and efficient circulation of research data; and to identify criteria for investments at each stage of data and research life cycles.
Related papers
- The Landscape of Data Reuse in Interactive Information Retrieval: Motivations, Sources, and Evaluation of Reusability [5.257245308437576]
This study investigated the data reuse practices of experienced researchers from the area of Interactive Information Retrieval (IIR) studies.
We conducted 21 semi-structured in-depth interviews with IIR researchers from varying demographic backgrounds, institutions, and stages of careers on their motivations, experiences, and concerns over data reuse.
arXiv Detail & Related papers (2024-11-23T03:15:31Z) - Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs)
We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs.
We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z) - Insights from an experiment crowdsourcing data from thousands of US Amazon users: The importance of transparency, money, and data use [6.794366017852433]
This paper shares an innovative approach to crowdsourcing user data to collect otherwise inaccessible Amazon purchase histories, spanning 5 years, from more than 5000 US users.
We developed a data collection tool that prioritizes participant consent and includes an experimental study design.
Experiment results (N=6325) reveal both monetary incentives and transparency can significantly increase data sharing.
arXiv Detail & Related papers (2024-04-19T20:45:19Z) - Efficient Data Collection for Robotic Manipulation via Compositional Generalization [70.76782930312746]
We show that policies can compose environmental factors from their data to succeed when encountering unseen factor combinations.
We propose better in-domain data collection strategies that exploit composition.
We provide videos at http://iliad.stanford.edu/robot-data-comp/.
arXiv Detail & Related papers (2024-03-08T07:15:38Z) - A Survey on Data Selection for Language Models [148.300726396877]
Data selection methods aim to determine which data points to include in a training dataset.
Deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive.
Few organizations have the resources for extensive data selection research.
arXiv Detail & Related papers (2024-02-26T18:54:35Z) - Ontologies for increasing the FAIRness of plant research data [0.0]
Onologies provide concepts for a particular domain as well as relationships between concepts.
By tagging with data terms data becomes both human machine interpretable, allowing increased reuse and interoperability.
We outline the most relevant to the fundamental plant sciences and how they can be used to annotate data related to plant-specific experiments.
arXiv Detail & Related papers (2023-08-25T13:08:26Z) - The Dimensions of Data Labor: A Road Map for Researchers, Activists, and
Policymakers to Empower Data Producers [14.392208044851976]
Data producers have little say in what data is captured, how it is used, or who it benefits.
Organizations with the ability to access and process this data, e.g. OpenAI and Google, possess immense power in shaping the technology landscape.
By synthesizing related literature that reconceptualizes the production of data for computing as data labor'', we outline opportunities for researchers, policymakers, and activists to empower data producers.
arXiv Detail & Related papers (2023-05-22T17:11:22Z) - Assessing Scientific Contributions in Data Sharing Spaces [64.16762375635842]
This paper introduces the SCIENCE-index, a blockchain-based metric measuring a researcher's scientific contributions.
To incentivize researchers to share their data, the SCIENCE-index is augmented to include a data-sharing parameter.
Our model is evaluated by comparing the distribution of its output for geographically diverse researchers to that of the h-index.
arXiv Detail & Related papers (2023-03-18T19:17:47Z) - DataPerf: Benchmarks for Data-Centric AI Development [81.03754002516862]
DataPerf is a community-led benchmark suite for evaluating ML datasets and data-centric algorithms.
We provide an open, online platform with multiple rounds of challenges to support this iterative development.
The benchmarks, online evaluation platform, and baseline implementations are open source.
arXiv Detail & Related papers (2022-07-20T17:47:54Z) - Research Trends and Applications of Data Augmentation Algorithms [77.34726150561087]
We identify the main areas of application of data augmentation algorithms, the types of algorithms used, significant research trends, their progression over time and research gaps in data augmentation literature.
We expect readers to understand the potential of data augmentation, as well as identify future research directions and open questions within data augmentation research.
arXiv Detail & Related papers (2022-07-18T11:38:32Z) - Subdivisions and Crossroads: Identifying Hidden Community Structures in
a Data Archive's Citation Network [1.6631602844999724]
This paper analyzes the community structure of an authoritative network of datasets cited in academic publications.
We identify communities of social science datasets and fields of research connected through shared data use.
Our research reveals the hidden structure of data reuse and demonstrates how interdisciplinary research communities organize around datasets as shared scientific inputs.
arXiv Detail & Related papers (2022-05-17T14:18:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.