Toward Reliable Scientific Hypothesis Generation: Evaluating Truthfulness and Hallucination in Large Language Models
- URL: http://arxiv.org/abs/2505.14599v2
- Date: Sun, 08 Jun 2025 22:42:31 GMT
- Title: Toward Reliable Scientific Hypothesis Generation: Evaluating Truthfulness and Hallucination in Large Language Models
- Authors: Guangzhi Xiong, Eric Xie, Corey Williams, Myles Kim, Amir Hassan Shariatmadari, Sikun Guo, Stefan Bekiranov, Aidong Zhang,
- Abstract summary: We introduce TruthHypo, a benchmark for assessing the capabilities of large language models in generating truthful hypotheses.<n>KnowHD is a knowledge-based hallucination detector to evaluate how well hypotheses are grounded in existing knowledge.
- Score: 18.850296587858946
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have shown significant potential in scientific disciplines such as biomedicine, particularly in hypothesis generation, where they can analyze vast literature, identify patterns, and suggest research directions. However, a key challenge lies in evaluating the truthfulness of generated hypotheses, as verifying their accuracy often requires substantial time and resources. Additionally, the hallucination problem in LLMs can lead to the generation of hypotheses that appear plausible but are ultimately incorrect, undermining their reliability. To facilitate the systematic study of these challenges, we introduce TruthHypo, a benchmark for assessing the capabilities of LLMs in generating truthful scientific hypotheses, and KnowHD, a knowledge-based hallucination detector to evaluate how well hypotheses are grounded in existing knowledge. Our results show that LLMs struggle to generate truthful hypotheses. By analyzing hallucinations in reasoning steps, we demonstrate that the groundedness scores provided by KnowHD serve as an effective metric for filtering truthful hypotheses from the diverse outputs of LLMs. Human evaluations further validate the utility of KnowHD in identifying truthful hypotheses and accelerating scientific discovery. Our data and source code are available at https://github.com/Teddy-XiongGZ/TruthHypo.
Related papers
- HypoChainer: A Collaborative System Combining LLMs and Knowledge Graphs for Hypothesis-Driven Scientific Discovery [4.020865072189471]
We propose HypoChainer, a visualization framework that integrates human expertise, knowledge graphs, and reasoning.<n> HypoChainer operates in three stages: First, exploration and contextualization -- experts use retrieval-augmented LLMs (RAGs) and dimensionality reduction.<n>Second, hypothesis chain formation -- experts iteratively examine KG relationships around predictions and semantically linked entities.<n>Third, validation prioritization -- refined hypotheses are filtered based on KG-supported evidence to identify high-priority candidates for experimentation.
arXiv Detail & Related papers (2025-07-23T05:02:54Z) - MOOSE-Chem2: Exploring LLM Limits in Fine-Grained Scientific Hypothesis Discovery via Hierarchical Search [93.64235254640967]
Large language models (LLMs) have shown promise in automating scientific hypothesis generation.<n>We define the novel task of fine-grained scientific hypothesis discovery.<n>We propose a hierarchical search method that incrementally proposes and integrates details into the hypothesis.
arXiv Detail & Related papers (2025-05-25T16:13:46Z) - ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition [67.26124739345332]
Large language models (LLMs) have demonstrated potential in assisting scientific research, yet their ability to discover high-quality research hypotheses remains unexamined.<n>We introduce the first large-scale benchmark for evaluating LLMs with a near-sufficient set of sub-tasks of scientific discovery.<n>We develop an automated framework that extracts critical components - research questions, background surveys, inspirations, and hypotheses - from scientific papers.
arXiv Detail & Related papers (2025-03-27T08:09:15Z) - Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment [108.55277188617035]
Large language models (LLMs) have been widely adopted in various downstream task domains, but their ability to directly recall and apply factual medical knowledge remains under-explored.<n>Most existing medical QA benchmarks assess complex reasoning or multi-hop inference, making it difficult to isolate LLMs' inherent medical knowledge from their reasoning capabilities.<n>We introduce the Medical Knowledge Judgment, a dataset specifically designed to measure LLMs' one-hop factual medical knowledge.
arXiv Detail & Related papers (2025-02-20T05:27:51Z) - Improving Scientific Hypothesis Generation with Knowledge Grounded Large Language Models [20.648157071328807]
Large language models (LLMs) can identify novel research directions by analyzing existing knowledge.
LLMs are prone to generating hallucinations'', outputs that are plausible-sounding but factually incorrect.
We propose KG-CoI, a system that enhances LLM hypothesis generation by integrating external, structured knowledge from knowledge graphs.
arXiv Detail & Related papers (2024-11-04T18:50:00Z) - MOOSE-Chem: Large Language Models for Rediscovering Unseen Chemistry Scientific Hypotheses [72.39144388083712]
It remains unclear whether large language models (LLMs) can autonomously generate novel and valid hypotheses in chemistry.<n>We develop a benchmark of 51 high-impact chemistry papers published and online after January 2024, each manually annotated by PhD chemists with background, inspirations, and hypothesis.<n>We assume that LLMs may already encode latent scientific knowledge associations not yet recognized by humans.
arXiv Detail & Related papers (2024-10-09T17:19:58Z) - Large Language Models as Biomedical Hypothesis Generators: A Comprehensive Evaluation [15.495976478018264]
Large language models (LLMs) have emerged as a promising tool to revolutionize knowledge interaction.
We construct a dataset of background-hypothesis pairs from biomedical literature, partitioned into training, seen, and unseen test sets.
We assess the hypothesis generation capabilities of top-tier instructed models in zero-shot, few-shot, and fine-tuning settings.
arXiv Detail & Related papers (2024-07-12T02:55:13Z) - LLM Internal States Reveal Hallucination Risk Faced With a Query [62.29558761326031]
Humans have a self-awareness process that allows us to recognize what we don't know when faced with queries.
This paper investigates whether Large Language Models can estimate their own hallucination risk before response generation.
By a probing estimator, we leverage LLM self-assessment, achieving an average hallucination estimation accuracy of 84.32% at run time.
arXiv Detail & Related papers (2024-07-03T17:08:52Z) - LLM and Simulation as Bilevel Optimizers: A New Paradigm to Advance Physical Scientific Discovery [141.39722070734737]
We propose to enhance the knowledge-driven, abstract reasoning abilities of Large Language Models with the computational strength of simulations.
We introduce Scientific Generative Agent (SGA), a bilevel optimization framework.
We conduct experiments to demonstrate our framework's efficacy in law discovery and molecular design.
arXiv Detail & Related papers (2024-05-16T03:04:10Z) - Large Language Models are Zero Shot Hypothesis Proposers [17.612235393984744]
Large Language Models (LLMs) hold a wealth of global and interdisciplinary knowledge that promises to break down information barriers.
We construct a dataset consist of background knowledge and hypothesis pairs from biomedical literature.
We evaluate the hypothesis generation capabilities of various top-tier instructed models in zero-shot, few-shot, and fine-tuning settings.
arXiv Detail & Related papers (2023-11-10T10:03:49Z) - Can Large Language Models Discern Evidence for Scientific Hypotheses? Case Studies in the Social Sciences [3.9985385067438344]
A strong hypothesis is a best guess based on existing evidence and informed by a comprehensive view of relevant literature.
With exponential increase in the number of scientific articles published annually, manual aggregation and synthesis of evidence related to a given hypothesis is a challenge.
We share a novel dataset for the task of scientific hypothesis evidencing using community-driven annotations of studies in the social sciences.
arXiv Detail & Related papers (2023-09-07T04:15:17Z) - Large Language Models for Automated Open-domain Scientific Hypotheses Discovery [50.40483334131271]
This work proposes the first dataset for social science academic hypotheses discovery.
Unlike previous settings, the new dataset requires (1) using open-domain data (raw web corpus) as observations; and (2) proposing hypotheses even new to humanity.
A multi- module framework is developed for the task, including three different feedback mechanisms to boost performance.
arXiv Detail & Related papers (2023-09-06T05:19:41Z) - SciMON: Scientific Inspiration Machines Optimized for Novelty [68.46036589035539]
We explore and enhance the ability of neural language models to generate novel scientific directions grounded in literature.
We take a dramatic departure with a novel setting in which models use as input background contexts.
We present SciMON, a modeling framework that uses retrieval of "inspirations" from past scientific papers.
arXiv Detail & Related papers (2023-05-23T17:12:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.