The Landscape of Data Reuse in Interactive Information Retrieval: Motivations, Sources, and Evaluation of Reusability
- URL: http://arxiv.org/abs/2411.15430v1
- Date: Sat, 23 Nov 2024 03:15:31 GMT
- Title: The Landscape of Data Reuse in Interactive Information Retrieval: Motivations, Sources, and Evaluation of Reusability
- Authors: Tianji Jiang, Wenqi Li, Jiqun Liu,
- Abstract summary: This study investigated the data reuse practices of experienced researchers from the area of Interactive Information Retrieval (IIR) studies.
We conducted 21 semi-structured in-depth interviews with IIR researchers from varying demographic backgrounds, institutions, and stages of careers on their motivations, experiences, and concerns over data reuse.
- Score: 5.257245308437576
- License:
- Abstract: Sharing and reusing research data can effectively reduce redundant efforts in data collection and curation, especially for small labs and research teams conducting human-centered system research, and enhance the replicability of evaluation experiments. Building a sustainable data reuse process and culture relies on frameworks that encompass policies, standards, roles, and responsibilities, all of which must address the diverse needs of data providers, curators, and reusers. To advance the knowledge and accumulate empirical understandings on data reuse, this study investigated the data reuse practices of experienced researchers from the area of Interactive Information Retrieval (IIR) studies, where data reuse has been strongly advocated but still remains a challenge. To enhance the knowledge on data reuse behavior and reusability assessment strategies within IIR community, we conducted 21 semi-structured in-depth interviews with IIR researchers from varying demographic backgrounds, institutions, and stages of careers on their motivations, experiences, and concerns over data reuse. We uncovered the reasons, strategies of reusability assessments, and challenges faced by data reusers within the field of IIR as they attempt to reuse researcher data in their studies. The empirical finding improves our understanding of researchers' motivations for reusing data, their approaches to discovering reusable research data, as well as their concerns and criteria for assessing data reusability, and also enriches the on-going discussions on evaluating user-generated data and research resources and promoting community-level data reuse culture and standards.
Related papers
- The State of Data Curation at NeurIPS: An Assessment of Dataset Development Practices in the Datasets and Benchmarks Track [1.5993707490601146]
This work provides an analysis of dataset development practices at NeurIPS through the lens of data curation.
We present an evaluation framework for dataset documentation, consisting of a rubric and toolkit.
Results indicate greater need for documentation about environmental footprint, ethical considerations, and data management.
arXiv Detail & Related papers (2024-10-29T19:07:50Z) - Synthetic Data Generation with Large Language Models for Personalized Community Question Answering [47.300506002171275]
We build Sy-SE-PQA based on an existing dataset, SE-PQA, which consists of questions and answers posted on the popular StackExchange communities.
Our findings suggest that LLMs have high potential in generating data tailored to users' needs.
The synthetic data can replace human-written training data, even if the generated data may contain incorrect information.
arXiv Detail & Related papers (2024-10-29T16:19:08Z) - Reproducibility Needs Reshape Scientific Data Governance [0.0]
Data governance should prioritize maximizing the utility of data throughout the research lifecycle.
Proactive analysis and data governance are integral and interconnected components of research lifecycle management.
arXiv Detail & Related papers (2024-09-29T22:13:19Z) - Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs)
We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs.
We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z) - ClimRetrieve: A Benchmarking Dataset for Information Retrieval from Corporate Climate Disclosures [3.348779089844034]
This work simulates the typical tasks of a sustainability analyst by examining 30 sustainability reports with 16 detailed climate-related questions.
We obtain a dataset with over 8.5K unique question-source-answer pairs labeled by different levels of relevance.
We develop a use case with the dataset to investigate the integration of expert knowledge into information retrieval with embeddings.
arXiv Detail & Related papers (2024-06-14T08:21:42Z) - Lazy Data Practices Harm Fairness Research [49.02318458244464]
We present a comprehensive analysis of fair ML datasets, demonstrating how unreflective practices hinder the reach and reliability of algorithmic fairness findings.
Our analyses identify three main areas of concern: (1) a textbflack of representation for certain protected attributes in both data and evaluations; (2) the widespread textbf of minorities during data preprocessing; and (3) textbfopaque data processing threatening the generalization of fairness research.
This study underscores the need for a critical reevaluation of data practices in fair ML and offers directions to improve both the sourcing and usage of datasets.
arXiv Detail & Related papers (2024-04-26T09:51:24Z) - From Data Creator to Data Reuser: Distance Matters [0.847136673632881]
Open science policies focus more heavily on data sharing than on reuse.
The value of data reuse lies in relationships between creators and reusers.
We develop the theoretical construct of distance between data creator and data reuser.
arXiv Detail & Related papers (2024-02-05T18:16:04Z) - Assessing Scientific Contributions in Data Sharing Spaces [64.16762375635842]
This paper introduces the SCIENCE-index, a blockchain-based metric measuring a researcher's scientific contributions.
To incentivize researchers to share their data, the SCIENCE-index is augmented to include a data-sharing parameter.
Our model is evaluated by comparing the distribution of its output for geographically diverse researchers to that of the h-index.
arXiv Detail & Related papers (2023-03-18T19:17:47Z) - Research Trends and Applications of Data Augmentation Algorithms [77.34726150561087]
We identify the main areas of application of data augmentation algorithms, the types of algorithms used, significant research trends, their progression over time and research gaps in data augmentation literature.
We expect readers to understand the potential of data augmentation, as well as identify future research directions and open questions within data augmentation research.
arXiv Detail & Related papers (2022-07-18T11:38:32Z) - Releasing survey microdata with exact cluster locations and additional
privacy safeguards [77.34726150561087]
We propose an alternative microdata dissemination strategy that leverages the utility of the original microdata with additional privacy safeguards.
Our strategy reduces the respondents' re-identification risk for any number of disclosed attributes by 60-80% even under re-identification attempts.
arXiv Detail & Related papers (2022-05-24T19:37:11Z) - Subdivisions and Crossroads: Identifying Hidden Community Structures in
a Data Archive's Citation Network [1.6631602844999724]
This paper analyzes the community structure of an authoritative network of datasets cited in academic publications.
We identify communities of social science datasets and fields of research connected through shared data use.
Our research reveals the hidden structure of data reuse and demonstrates how interdisciplinary research communities organize around datasets as shared scientific inputs.
arXiv Detail & Related papers (2022-05-17T14:18:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.