Related papers: The Landscape of Data Reuse in Interactive Information Retrieval: Motivations, Sources, and Evaluation of Reusability

The Landscape of Data Reuse in Interactive Information Retrieval: Motivations, Sources, and Evaluation of Reusability

URL: http://arxiv.org/abs/2411.15430v1
Date: Sat, 23 Nov 2024 03:15:31 GMT
Title: The Landscape of Data Reuse in Interactive Information Retrieval: Motivations, Sources, and Evaluation of Reusability
Authors: Tianji Jiang, Wenqi Li, Jiqun Liu,
Abstract summary: This study investigated the data reuse practices of experienced researchers from the area of Interactive Information Retrieval (IIR) studies. We conducted 21 semi-structured in-depth interviews with IIR researchers from varying demographic backgrounds, institutions, and stages of careers on their motivations, experiences, and concerns over data reuse.
Score: 5.257245308437576
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Sharing and reusing research data can effectively reduce redundant efforts in data collection and curation, especially for small labs and research teams conducting human-centered system research, and enhance the replicability of evaluation experiments. Building a sustainable data reuse process and culture relies on frameworks that encompass policies, standards, roles, and responsibilities, all of which must address the diverse needs of data providers, curators, and reusers. To advance the knowledge and accumulate empirical understandings on data reuse, this study investigated the data reuse practices of experienced researchers from the area of Interactive Information Retrieval (IIR) studies, where data reuse has been strongly advocated but still remains a challenge. To enhance the knowledge on data reuse behavior and reusability assessment strategies within IIR community, we conducted 21 semi-structured in-depth interviews with IIR researchers from varying demographic backgrounds, institutions, and stages of careers on their motivations, experiences, and concerns over data reuse. We uncovered the reasons, strategies of reusability assessments, and challenges faced by data reusers within the field of IIR as they attempt to reuse researcher data in their studies. The empirical finding improves our understanding of researchers' motivations for reusing data, their approaches to discovering reusable research data, as well as their concerns and criteria for assessing data reusability, and also enriches the on-going discussions on evaluating user-generated data and research resources and promoting community-level data reuse culture and standards.

Related papers

Understanding Usage and Engagement in AI-Powered Scientific Research Tools: The Asta Interaction Dataset [47.98539809308384]
We analyze the Asta Interaction dataset, a large-scale resource comprising over 200,000 user queries and interaction logs.<n>We characterize query patterns, engagement behaviors, and how usage evolves with experience.<n>We release the anonymized dataset and analysis with a new query taxonomy to inform future designs of real-world AI research assistants.
arXiv Detail & Related papers (2026-02-26T18:40:28Z)
LISP -- A Rich Interaction Dataset and Loggable Interactive Search Platform [10.637323019551035]
We present a reusable dataset and accompanying infrastructure for studying human search behavior in Interactive Information Retrieval (IIR)<n>The dataset combines detailed interaction logs from 61 participants with user characteristics, including perceptual speed, topic-specific interest, search expertise, and demographic information.
arXiv Detail & Related papers (2026-01-14T10:49:13Z)
Improving Data Reusability in Interactive Information Retrieval: Insights from the Community [6.651828119227614]
This study aims to expand upon current findings by exploring IIR researchers' information-obtaining behaviors regarding data reuse.<n>We identified the information about shared data characteristics that IIR researchers need when evaluating data reusability.
arXiv Detail & Related papers (2025-12-20T09:12:33Z)
ScIRGen: Synthesize Realistic and Large-Scale RAG Dataset for Scientific Research [15.983924435685553]
We develop ScIRGen, a dataset generation framework for scientific QA & retrieval.<n>We use it to create a large-scale scientific retrieval-augmented generation (RAG) dataset with realistic queries, datasets and papers.
arXiv Detail & Related papers (2025-06-09T11:47:13Z)
A Comprehensive Survey on Imbalanced Data Learning [56.65067795190842]
imbalanced data is prevalent in various types of raw data and hinders the performance of machine learning.<n>This survey systematically analyzes various real-world data formats.<n>It concludes existing researches for different data formats into four categories: data re-balancing, feature representation, training strategy, and ensemble learning.
arXiv Detail & Related papers (2025-02-13T04:53:17Z)
Exploring Retrospective Meeting Practices and the Use of Data in Agile Teams [43.16629507708997]
This study explores barriers to project data utilization, including psychological safety concerns and the disconnect between data collection and meaningful integration of data into retrospective meetings. Our findings confirm that although teams routinely collect project data, they seldom employ it systematically during retrospectives.
arXiv Detail & Related papers (2025-02-05T19:33:53Z)
The State of Data Curation at NeurIPS: An Assessment of Dataset Development Practices in the Datasets and Benchmarks Track [1.5993707490601146]
This work provides an analysis of dataset development practices at NeurIPS through the lens of data curation. We present an evaluation framework for dataset documentation, consisting of a rubric and toolkit. Results indicate greater need for documentation about environmental footprint, ethical considerations, and data management.
arXiv Detail & Related papers (2024-10-29T19:07:50Z)
Synthetic Data Generation with Large Language Models for Personalized Community Question Answering [47.300506002171275]
We build Sy-SE-PQA based on an existing dataset, SE-PQA, which consists of questions and answers posted on the popular StackExchange communities. Our findings suggest that LLMs have high potential in generating data tailored to users' needs. The synthetic data can replace human-written training data, even if the generated data may contain incorrect information.
arXiv Detail & Related papers (2024-10-29T16:19:08Z)
Reproducibility Needs Reshape Scientific Data Governance [0.0]
Data governance should prioritize maximizing the utility of data throughout the research lifecycle. Proactive analysis and data governance are integral and interconnected components of research lifecycle management.
arXiv Detail & Related papers (2024-09-29T22:13:19Z)
Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs) We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs. We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z)
ClimRetrieve: A Benchmarking Dataset for Information Retrieval from Corporate Climate Disclosures [3.348779089844034]
This work simulates the typical tasks of a sustainability analyst by examining 30 sustainability reports with 16 detailed climate-related questions. We obtain a dataset with over 8.5K unique question-source-answer pairs labeled by different levels of relevance. We develop a use case with the dataset to investigate the integration of expert knowledge into information retrieval with embeddings.
arXiv Detail & Related papers (2024-06-14T08:21:42Z)
Lazy Data Practices Harm Fairness Research [49.02318458244464]
We present a comprehensive analysis of fair ML datasets, demonstrating how unreflective practices hinder the reach and reliability of algorithmic fairness findings. Our analyses identify three main areas of concern: (1) a textbflack of representation for certain protected attributes in both data and evaluations; (2) the widespread textbf of minorities during data preprocessing; and (3) textbfopaque data processing threatening the generalization of fairness research. This study underscores the need for a critical reevaluation of data practices in fair ML and offers directions to improve both the sourcing and usage of datasets.
arXiv Detail & Related papers (2024-04-26T09:51:24Z)
From Data Creator to Data Reuser: Distance Matters [0.847136673632881]
Open science policies focus more heavily on data sharing than on reuse. The value of data reuse lies in relationships between creators and reusers. We develop the theoretical construct of distance between data creator and data reuser.
arXiv Detail & Related papers (2024-02-05T18:16:04Z)
Assessing Scientific Contributions in Data Sharing Spaces [64.16762375635842]
This paper introduces the SCIENCE-index, a blockchain-based metric measuring a researcher's scientific contributions. To incentivize researchers to share their data, the SCIENCE-index is augmented to include a data-sharing parameter. Our model is evaluated by comparing the distribution of its output for geographically diverse researchers to that of the h-index.
arXiv Detail & Related papers (2023-03-18T19:17:47Z)
Research Trends and Applications of Data Augmentation Algorithms [77.34726150561087]
We identify the main areas of application of data augmentation algorithms, the types of algorithms used, significant research trends, their progression over time and research gaps in data augmentation literature. We expect readers to understand the potential of data augmentation, as well as identify future research directions and open questions within data augmentation research.
arXiv Detail & Related papers (2022-07-18T11:38:32Z)
Releasing survey microdata with exact cluster locations and additional privacy safeguards [77.34726150561087]
We propose an alternative microdata dissemination strategy that leverages the utility of the original microdata with additional privacy safeguards. Our strategy reduces the respondents' re-identification risk for any number of disclosed attributes by 60-80% even under re-identification attempts.
arXiv Detail & Related papers (2022-05-24T19:37:11Z)
Subdivisions and Crossroads: Identifying Hidden Community Structures in a Data Archive's Citation Network [1.6631602844999724]
This paper analyzes the community structure of an authoritative network of datasets cited in academic publications. We identify communities of social science datasets and fields of research connected through shared data use. Our research reveals the hidden structure of data reuse and demonstrates how interdisciplinary research communities organize around datasets as shared scientific inputs.
arXiv Detail & Related papers (2022-05-17T14:18:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.