The Refutability Gap: Challenges in Validating Reasoning by Large Language Models
- URL: http://arxiv.org/abs/2601.02380v1
- Date: Thu, 18 Dec 2025 14:42:03 GMT
- Title: The Refutability Gap: Challenges in Validating Reasoning by Large Language Models
- Authors: Elchanan Mossel,
- Abstract summary: Recent reports claim that Large Language Models (LLMs) have achieved the ability to derive new science and exhibit human-level general intelligence.<n>We argue that such claims are not rigorous scientific claims, as they do not satisfy Popper's refutability principle.
- Score: 11.210425433215827
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent reports claim that Large Language Models (LLMs) have achieved the ability to derive new science and exhibit human-level general intelligence. We argue that such claims are not rigorous scientific claims, as they do not satisfy Popper's refutability principle (often termed falsifiability), which requires that scientific statements be capable of being disproven. We identify several methodological pitfalls in current AI research on reasoning, including the inability to verify the novelty of findings due to opaque and non-searchable training data, the lack of reproducibility caused by continuous model updates, and the omission of human-interaction transcripts, which obscures the true source of scientific discovery. Additionally, the absence of counterfactuals and data on failed attempts creates a selection bias that may exaggerate LLM capabilities. To address these challenges, we propose guidelines for scientific transparency and reproducibility for research on reasoning by LLMs. Establishing such guidelines is crucial for both scientific integrity and the ongoing societal debates regarding fair data usage.
Related papers
- Atomic Reasoning for Scientific Table Claim Verification [83.14588611859826]
Non-experts are susceptible to misleading claims based on scientific tables due to their high information density and perceived credibility.<n>Existing table claim verification models, including state-of-the-art large language models (LLMs), often struggle with precise fine-grained reasoning.<n>Inspired by Cognitive Load Theory, we propose that enhancing a model's ability to interpret table-based claims involves reducing cognitive load.
arXiv Detail & Related papers (2025-06-08T02:46:22Z) - Toward Reliable Scientific Hypothesis Generation: Evaluating Truthfulness and Hallucination in Large Language Models [18.850296587858946]
We introduce TruthHypo, a benchmark for assessing the capabilities of large language models in generating truthful hypotheses.<n>KnowHD is a knowledge-based hallucination detector to evaluate how well hypotheses are grounded in existing knowledge.
arXiv Detail & Related papers (2025-05-20T16:49:40Z) - ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition [67.26124739345332]
Large language models (LLMs) have demonstrated potential in assisting scientific research, yet their ability to discover high-quality research hypotheses remains unexamined.<n>We introduce the first large-scale benchmark for evaluating LLMs with a near-sufficient set of sub-tasks of scientific discovery.<n>We develop an automated framework that extracts critical components - research questions, background surveys, inspirations, and hypotheses - from scientific papers.
arXiv Detail & Related papers (2025-03-27T08:09:15Z) - Improving Scientific Hypothesis Generation with Knowledge Grounded Large Language Models [20.648157071328807]
Large language models (LLMs) can identify novel research directions by analyzing existing knowledge.
LLMs are prone to generating hallucinations'', outputs that are plausible-sounding but factually incorrect.
We propose KG-CoI, a system that enhances LLM hypothesis generation by integrating external, structured knowledge from knowledge graphs.
arXiv Detail & Related papers (2024-11-04T18:50:00Z) - Grounding Fallacies Misrepresenting Scientific Publications in Evidence [84.32990746227385]
We introduce MissciPlus, an extension of the fallacy detection dataset Missci.<n>MissciPlus pairs the real-world misrepresented evidence with incorrect claims, identical to the input to evidence-based fact-checking models.<n>Our findings show that current fact-checking models struggle to use misrepresented scientific passages to refute misinformation.
arXiv Detail & Related papers (2024-08-23T03:16:26Z) - Missci: Reconstructing Fallacies in Misrepresented Science [84.32990746227385]
Health-related misinformation on social networks can lead to poor decision-making and real-world dangers.
Missci is a novel argumentation theoretical model for fallacious reasoning.
We present Missci as a dataset to test the critical reasoning abilities of large language models.
arXiv Detail & Related papers (2024-06-05T12:11:10Z) - Empirical evaluation of Uncertainty Quantification in
Retrieval-Augmented Language Models for Science [0.0]
This study investigates how uncertainty scores vary when scientific knowledge is incorporated as pretraining and retrieval data.
We observe that an existing RALM finetuned with scientific knowledge as the retrieval data tends to be more confident in generating predictions.
We also found that RALMs are overconfident in their predictions, making inaccurate predictions more confidently than accurate ones.
arXiv Detail & Related papers (2023-11-15T20:42:11Z) - A Closer Look at the Self-Verification Abilities of Large Language Models in Logical Reasoning [73.77088902676306]
We take a closer look at the self-verification abilities of large language models (LLMs) in the context of logical reasoning.
Our main findings suggest that existing LLMs could struggle to identify fallacious reasoning steps accurately and may fall short of guaranteeing the validity of self-verification methods.
arXiv Detail & Related papers (2023-11-14T07:13:10Z) - Evaluating the Effectiveness of Retrieval-Augmented Large Language
Models in Scientific Document Reasoning [0.0]
Large Language Model (LLM) often provide seemingly plausible but not factual information, often referred to as hallucinations.
Retrieval-augmented LLMs provide a non-parametric approach to solve these issues by retrieving relevant information from external data sources.
We critically evaluate these models in their ability to perform in scientific document reasoning tasks.
arXiv Detail & Related papers (2023-11-07T21:09:57Z) - SCITAB: A Challenging Benchmark for Compositional Reasoning and Claim
Verification on Scientific Tables [68.76415918462418]
We present SCITAB, a challenging evaluation dataset consisting of 1.2K expert-verified scientific claims.
Through extensive evaluations, we demonstrate that SCITAB poses a significant challenge to state-of-the-art models.
Our analysis uncovers several unique challenges posed by SCITAB, including table grounding, claim ambiguity, and compositional reasoning.
arXiv Detail & Related papers (2023-05-22T16:13:50Z) - Generating Scientific Claims for Zero-Shot Scientific Fact Checking [54.62086027306609]
Automated scientific fact checking is difficult due to the complexity of scientific language and a lack of significant amounts of training data.
We propose scientific claim generation, the task of generating one or more atomic and verifiable claims from scientific sentences.
We also demonstrate its usefulness in zero-shot fact checking for biomedical claims.
arXiv Detail & Related papers (2022-03-24T11:29:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.