Related papers: Use as Directed? A Comparison of Software Tools Intended to Check Rigor and Transparency of Published Work

Use as Directed? A Comparison of Software Tools Intended to Check Rigor and Transparency of Published Work

URL: http://arxiv.org/abs/2507.17991v1
Date: Wed, 23 Jul 2025 23:49:28 GMT
Title: Use as Directed? A Comparison of Software Tools Intended to Check Rigor and Transparency of Published Work
Authors: Peter Eckmann, Adrian Barnett, Alexandra Bannach-Brown, Elisa Pilar Bascunan Atria, Guillaume Cabanac, Louise Delwen Owen Franzen, Małgorzata Anna Gazda, Kaitlyn Hair, James Howison, Halil Kilicoglu, Cyril Labbe, Sarah McCann, Vladislav Nachev, Martijn Roelandse, Maia Salholz-Hillel, Robert Schulz, Gerben ter Riet, Colby Vorland, Anita Bandrowski, Tracey Weissgerber,
Abstract summary: Lack of standardization and transparency in scientific reporting is a major problem.<n>There are several automated tools that have been designed to check different rigor criteria.<n>We have conducted a broad comparison of 11 automated tools across 9 different rigor criteria from the ScreenIT group.
Score: 28.252424517077557
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The causes of the reproducibility crisis include lack of standardization and transparency in scientific reporting. Checklists such as ARRIVE and CONSORT seek to improve transparency, but they are not always followed by authors and peer review often fails to identify missing items. To address these issues, there are several automated tools that have been designed to check different rigor criteria. We have conducted a broad comparison of 11 automated tools across 9 different rigor criteria from the ScreenIT group. We found some criteria, including detecting open data, where the combination of tools showed a clear winner, a tool which performed much better than other tools. In other cases, including detection of inclusion and exclusion criteria, the combination of tools exceeded the performance of any one tool. We also identified key areas where tool developers should focus their effort to make their tool maximally useful. We conclude with a set of insights and recommendations for stakeholders in the development of rigor and transparency detection tools. The code and data for the study is available at https://github.com/PeterEckmann1/tool-comparison.

Related papers

Tool Graph Retriever: Exploring Dependency Graph-based Tool Retrieval for Large Language Models [43.50789219459378]
We propose Tool Graph Retriever (TGR), which exploits the dependencies among tools to learn better tool representations for retrieval.<n>First, we construct a dataset termed TDI300K to train a discriminator for identifying tool dependencies.<n>Then, we represent all candidate tools as a tool dependency graph and use graph convolution to integrate the dependencies into their representations.
arXiv Detail & Related papers (2025-08-07T08:36:26Z)
ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks [54.52092001110694]
ThinkGeo is a benchmark designed to evaluate tool-augmented agents on remote sensing tasks via structured tool use and multi-step planning.<n>Inspired by tool-interaction paradigms, ThinkGeo includes human-curated queries spanning a wide range of real-world applications.<n>Our analysis reveals notable disparities in tool accuracy and planning consistency across models.
arXiv Detail & Related papers (2025-05-29T17:59:38Z)
Vexed by VEX tools: Consistency evaluation of container vulnerability scanners [0.0]
This paper presents a study that analyzed state-of-the-art vulnerability scanning tools applied to containers.<n>We have focused the work on tools following the Vulnerability Exploitability eXchange (VEX) format.
arXiv Detail & Related papers (2025-03-18T16:22:43Z)
Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger [49.81945268343162]
We propose MeCo, an adaptive decision-making strategy for external tool use.<n>MeCo quantifies metacognitive scores by capturing high-level cognitive signals in the representation space.<n>MeCo is fine-tuning-free and incurs minimal cost.
arXiv Detail & Related papers (2025-02-18T15:45:01Z)
On the Limitations of Combining Sentiment Analysis Tools in a Cross-Platform Setting [2.3818760805173342]
We analyze a combination of three sentiment analysis tools in a voting classifier according to their reliability and performance.<n>The results indicate that this kind of combination of tools is a good choice in the within-platform setting.<n>However, a majority vote does not necessarily lead to better results when applying in cross-platform domains.
arXiv Detail & Related papers (2025-02-10T16:51:51Z)
Does the Tool Matter? Exploring Some Causes of Threats to Validity in Mining Software Repositories [9.539825294372786]
We use two tools to extract and analyse ten large software projects.<n>Despite similar trends, even simple metrics such as the numbers of commits and developers may differ by up to 500%.<n>We find that such substantial differences are often caused by minor technical details.
arXiv Detail & Related papers (2025-01-25T07:42:56Z)
Re-Invoke: Tool Invocation Rewriting for Zero-Shot Tool Retrieval [47.81307125613145]
Re-Invoke is an unsupervised tool retrieval method designed to scale effectively to large toolsets without training. We employ a novel multi-view similarity ranking strategy based on intents to pinpoint the most relevant tools for each query. Our evaluation demonstrates that Re-Invoke significantly outperforms state-of-the-art alternatives in both single-tool and multi-tool scenarios.
arXiv Detail & Related papers (2024-08-03T22:49:27Z)
Seal-Tools: Self-Instruct Tool Learning Dataset for Agent Tuning and Detailed Benchmark [8.573278807410507]
This paper presents a new tool learning dataset Seal-Tools. Seal-Tools contains self-instruct API-like tools. It also includes instances which demonstrate the practical application of tools.
arXiv Detail & Related papers (2024-05-14T06:50:19Z)
What Are Tools Anyway? A Survey from the Language Model Perspective [67.18843218893416]
Language models (LMs) are powerful yet mostly for text generation tasks. We provide a unified definition of tools as external programs used by LMs. We empirically study the efficiency of various tooling methods.
arXiv Detail & Related papers (2024-03-18T17:20:07Z)
TOOLVERIFIER: Generalization to New Tools via Self-Verification [69.85190990517184]
We introduce a self-verification method which distinguishes between close candidates by self-asking contrastive questions during tool selection. Experiments on 4 tasks from the ToolBench benchmark, consisting of 17 unseen tools, demonstrate an average improvement of 22% over few-shot baselines.
arXiv Detail & Related papers (2024-02-21T22:41:38Z)
MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use [79.87054552116443]
Large language models (LLMs) have garnered significant attention due to their impressive natural language processing (NLP) capabilities.<n>We introduce MetaTool, a benchmark designed to evaluate whether LLMs have tool usage awareness and can correctly choose tools.<n>We conduct experiments involving eight popular LLMs and find that the majority of them still struggle to effectively select tools.
arXiv Detail & Related papers (2023-10-04T19:39:26Z)
A Comprehensive Study on Quality Assurance Tools for Java [15.255117038871337]
Quality assurance (QA) tools are receiving more and more attention and are widely used by developers. Most existing research is limited in the following ways:. They compare tools without considering scanning rules analysis. They disagree on the effectiveness of tools due to the study methodology and benchmark dataset. There is no large-scale study on the analysis of time performance.
arXiv Detail & Related papers (2023-05-26T10:48:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.