Beyond Citations: A Cross-Domain Metric for Dataset Impact and Shareability
- URL: http://arxiv.org/abs/2511.12966v1
- Date: Mon, 17 Nov 2025 04:45:19 GMT
- Title: Beyond Citations: A Cross-Domain Metric for Dataset Impact and Shareability
- Authors: Smitha Muthya Sudheendra, Zhongxing Zhang, Wenwen Cao, Jisu Huh, Jaideep Srivastava,
- Abstract summary: The X-index is a novel author-level metric that quantifies the value of data contributions through a two-step process.<n>We validate our approach against expert ratings, achieving a strong correlation.<n>The X-index encourages sustainable data-sharing practices and gives institutions, funders, and platforms a tangible way to acknowledge the lasting influence of research datasets.
- Score: 2.1689170017681696
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The scientific community increasingly relies on open data sharing, yet existing metrics inadequately capture the true impact of datasets as research outputs. Traditional measures, such as the h-index, focus on publications and citations but fail to account for dataset accessibility, reuse, and cross-disciplinary influence. We propose the X-index, a novel author-level metric that quantifies the value of data contributions through a two-step process: (i) computing a dataset-level value score (V-score) that integrates breadth of reuse, FAIRness, citation impact, and transitive reuse depth, and (ii) aggregating V-scores into an author-level X-index. Using datasets from computational social science, medicine, and crisis communication, we validate our approach against expert ratings, achieving a strong correlation. Our results demonstrate that the X-index provides a transparent, scalable, and low-cost framework for assessing data-sharing practices and incentivizing open science. The X-index encourages sustainable data-sharing practices and gives institutions, funders, and platforms a tangible way to acknowledge the lasting influence of research datasets.
Related papers
- Assessing the impact of Open Research Information Infrastructures using NLP driven full-text Scientometrics: A case study of the LXCat open-access platform [0.3694429692322631]
Open research information (ORI) play a central role in shaping how scientific knowledge is produced, disseminated, validated, and reused across the research lifecycle.<n>We present a full-text, natural language processing (NLP) driven scientometric framework to quantify the impact of ORI infrastructures beyond citation counts.
arXiv Detail & Related papers (2026-02-07T19:15:40Z) - Multi-Disciplinary Dataset Discovery from Citation-Verified Literature Contexts [0.0]
We introduce a literature-driven framework that discovers datasets from citation contexts in scientific papers.<n>Our approach combines large-scale citation-context extraction, schema-guided dataset recognition, and provenance-preserving entity resolution.<n>We release our code, evaluation datasets, and results on GitHub.
arXiv Detail & Related papers (2026-01-08T16:46:06Z) - OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value [74.80873109856563]
OpenDataArena (ODA) is a holistic and open platform designed to benchmark the intrinsic value of post-training data.<n>ODA establishes a comprehensive ecosystem comprising four key pillars: (i) a unified training-evaluation pipeline that ensures fair, open comparisons across diverse models; (ii) a multi-dimensional scoring framework that profiles data quality along tens of distinct axes; and (iii) an interactive data lineage explorer to visualize dataset genealogy and dissect component sources.
arXiv Detail & Related papers (2025-12-16T03:33:24Z) - SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents [49.54155332262579]
We release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles.
Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations.
arXiv Detail & Related papers (2024-10-28T15:56:49Z) - Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs)
We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs.
We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z) - All Data on the Table: Novel Dataset and Benchmark for Cross-Modality
Scientific Information Extraction [39.05577374775964]
We propose a semi-supervised pipeline for annotating entities in text, as well as entities and relations in tables, in an iterative procedure.
We release novel resources for the scientific community, including a high-quality benchmark, a large-scale corpus, and a semi-supervised annotation pipeline.
arXiv Detail & Related papers (2023-11-14T14:22:47Z) - Assessing Scientific Contributions in Data Sharing Spaces [64.16762375635842]
This paper introduces the SCIENCE-index, a blockchain-based metric measuring a researcher's scientific contributions.
To incentivize researchers to share their data, the SCIENCE-index is augmented to include a data-sharing parameter.
Our model is evaluated by comparing the distribution of its output for geographically diverse researchers to that of the h-index.
arXiv Detail & Related papers (2023-03-18T19:17:47Z) - DataPerf: Benchmarks for Data-Centric AI Development [81.03754002516862]
DataPerf is a community-led benchmark suite for evaluating ML datasets and data-centric algorithms.
We provide an open, online platform with multiple rounds of challenges to support this iterative development.
The benchmarks, online evaluation platform, and baseline implementations are open source.
arXiv Detail & Related papers (2022-07-20T17:47:54Z) - Utilizing Out-Domain Datasets to Enhance Multi-Task Citation Analysis [4.526582372434088]
Citation sentiment analysis suffers from both data scarcity and tremendous costs for dataset annotation.
We explore the impact of out-domain data during training to enhance the model performance.
We propose an end-to-end trainable multi-task model that covers the sentiment and intent analysis.
arXiv Detail & Related papers (2022-02-22T13:33:48Z) - Reliable Evaluations for Natural Language Inference based on a Unified
Cross-dataset Benchmark [54.782397511033345]
Crowd-sourced Natural Language Inference (NLI) datasets may suffer from significant biases like annotation artifacts.
We present a new unified cross-datasets benchmark with 14 NLI datasets and re-evaluate 9 widely-used neural network-based NLI models.
Our proposed evaluation scheme and experimental baselines could provide a basis to inspire future reliable NLI research.
arXiv Detail & Related papers (2020-10-15T11:50:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.