Don't mention it: An approach to assess challenges to using software
mentions for citation and discoverability research
- URL: http://arxiv.org/abs/2402.14602v1
- Date: Thu, 22 Feb 2024 14:51:17 GMT
- Title: Don't mention it: An approach to assess challenges to using software
mentions for citation and discoverability research
- Authors: Stephan Druskat, Neil P. Chue Hong, Sammie Buzzard, Olexandr
Konovalov, Patrick Kornek
- Abstract summary: We present an approach to assess the usability of such datasets for research on research software.
One dataset does not provide links to mentioned software at all, the other does so in a way that can impede quantitative research endeavors.
The greatest challenge and underlying issue in working with software mention datasets is the still suboptimal practice of software citation.
- Score: 0.3268055538225029
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Datasets collecting software mentions from scholarly publications can
potentially be used for research into the software that has been used in the
published research, as well as into the practice of software citation.
Recently, new software mention datasets with different characteristics have
been published. We present an approach to assess the usability of such datasets
for research on research software. Our approach includes sampling and data
preparation, manual annotation for quality and mention characteristics, and
annotation analysis. We applied it to two software mention datasets for
evaluation based on qualitative observation. Doing this, we were able to find
challenges to working with the selected datasets to do research. Main issues
refer to the structure of the dataset, the quality of the extracted mentions
(54% and 23% of mentions respectively are not to software), and software
accessibility. While one dataset does not provide links to mentioned software
at all, the other does so in a way that can impede quantitative research
endeavors: (1) Links may come from different sources and each point to
different software for the same mention. (2) The quality of the automatically
retrieved links is generally poor (in our sample, 65.4% link the wrong
software). (3) Links exist only for a small subset (in our sample, 20.5%) of
mentions, which may lead to skewed or disproportionate samples. However, the
greatest challenge and underlying issue in working with software mention
datasets is the still suboptimal practice of software citation: Software should
not be mentioned, it should be cited following the software citation
principles.
Related papers
- Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs)
We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs.
We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z) - Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with Data [89.2410799619405]
We introduce the Quantitative Reasoning with Data benchmark to evaluate Large Language Models' capability in statistical and causal reasoning with real-world data.
The benchmark comprises a dataset of 411 questions accompanied by data sheets from textbooks, online learning materials, and academic papers.
To compare models' quantitative reasoning abilities on data and text, we enrich the benchmark with an auxiliary set of 290 text-only questions, namely QRText.
arXiv Detail & Related papers (2024-02-27T16:15:03Z) - Towards a Quality Indicator for Research Data publications and Research
Software publications -- A vision from the Helmholtz Association [0.24848203755267903]
There is not yet an established process to assess and evaluate quality of research data and research software publications.
The Task Group Quality Indicators for Data and Software Publications currently develops a quality indicator for research data and research software publications.
arXiv Detail & Related papers (2024-01-16T20:00:27Z) - TRIAD: Automated Traceability Recovery based on Biterm-enhanced
Deduction of Transitive Links among Artifacts [53.92293118080274]
Traceability allows stakeholders to extract and comprehend the trace links among software artifacts introduced across the software life cycle.
Most rely on textual similarities among software artifacts, such as those based on Information Retrieval (IR)
arXiv Detail & Related papers (2023-12-28T06:44:24Z) - How do software citation formats evolve over time? A longitudinal
analysis of R programming language packages [12.082972614614413]
This study compares and analyzes a longitudinal dataset of citation formats of all R packages collected in 2021 and 2022.
We investigate the different document types underlying the citations and what metadata elements in the citation formats changed over time.
arXiv Detail & Related papers (2023-07-17T09:18:57Z) - Analyzing Dataset Annotation Quality Management in the Wild [63.07224587146207]
Even popular datasets used to train and evaluate state-of-the-art models contain a non-negligible amount of erroneous annotations, biases, or artifacts.
While practices and guidelines regarding dataset creation projects exist, large-scale analysis has yet to be performed on how quality management is conducted.
arXiv Detail & Related papers (2023-07-16T21:22:40Z) - A Metadata-Based Ecosystem to Improve the FAIRness of Research Software [0.3185506103768896]
The reuse of research software is central to research efficiency and academic exchange.
The DataDesc ecosystem is presented, an approach to describing data models of software interfaces with detailed and machine-actionable metadata.
arXiv Detail & Related papers (2023-06-18T19:01:08Z) - A Gold Standard Dataset for the Reviewer Assignment Problem [117.59690218507565]
"Similarity score" is a numerical estimate of the expertise of a reviewer in reviewing a paper.
Our dataset consists of 477 self-reported expertise scores provided by 58 researchers.
For the task of ordering two papers in terms of their relevance for a reviewer, the error rates range from 12%-30% in easy cases to 36%-43% in hard cases.
arXiv Detail & Related papers (2023-03-23T16:15:03Z) - Semantically-enhanced Topic Recommendation System for Software Projects [2.0625936401496237]
Tagging software repositories with relevant topics can be exploited for facilitating various downstream tasks.
There have been efforts on recommending topics for software projects, however, the semantic relationships among these topics have not been exploited so far.
We propose two recommender models for tagging software projects that incorporate the semantic relationship among topics.
arXiv Detail & Related papers (2022-05-31T19:54:42Z) - SoMeSci- A 5 Star Open Data Gold Standard Knowledge Graph of Software
Mentions in Scientific Articles [1.335443972283229]
SoMeSci is a knowledge graph of software mentions in scientific articles.
It contains high quality annotations (IRR: $kappa=.82$) of 3756 software mentions in 1367 PubMed Central articles.
arXiv Detail & Related papers (2021-08-20T08:53:03Z) - Competency Problems: On Finding and Removing Artifacts in Language Data [50.09608320112584]
We argue that for complex language understanding tasks, all simple feature correlations are spurious.
We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account.
arXiv Detail & Related papers (2021-04-17T21:34:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.