Related papers: HARPA: A Testability-Driven, Literature-Grounded Framework for Research Ideation

HARPA: A Testability-Driven, Literature-Grounded Framework for Research Ideation

URL: http://arxiv.org/abs/2510.00620v1
Date: Wed, 01 Oct 2025 07:52:19 GMT
Title: HARPA: A Testability-Driven, Literature-Grounded Framework for Research Ideation
Authors: Rosni Vasu, Peter Jansen, Pao Siangliulue, Cristina Sarasua, Abraham Bernstein, Peter Clark, Bhavana Dalvi Mishra,
Abstract summary: HARPA is a tool to generate hypotheses that are both testable and grounded in the scientific literature.<n>Our evaluations show that HARPA-generated hypothesis-driven research proposals perform comparably to a strong baseline AI-researcher.<n>When tested with the ASD agent (CodeScientist), HARPA produced more successful executions (20 vs. 11 out of 40) and fewer failures (16 vs. 21 out of 40)
Score: 29.9491787481972
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While there has been a surge of interest in automated scientific discovery (ASD), especially with the emergence of LLMs, it remains challenging for tools to generate hypotheses that are both testable and grounded in the scientific literature. Additionally, existing ideation tools are not adaptive to prior experimental outcomes. We developed HARPA to address these challenges by incorporating the ideation workflow inspired by human researchers. HARPA first identifies emerging research trends through literature mining, then explores hypothesis design spaces, and finally converges on precise, testable hypotheses by pinpointing research gaps and justifying design choices. Our evaluations show that HARPA-generated hypothesis-driven research proposals perform comparably to a strong baseline AI-researcher across most qualitative dimensions (e.g., specificity, novelty, overall quality), but achieve significant gains in feasibility(+0.78, p$<0.05$, bootstrap) and groundedness (+0.85, p$<0.01$, bootstrap) on a 10-point Likert scale. When tested with the ASD agent (CodeScientist), HARPA produced more successful executions (20 vs. 11 out of 40) and fewer failures (16 vs. 21 out of 40), showing that expert feasibility judgments track with actual execution success. Furthermore, to simulate how researchers continuously refine their understanding of what hypotheses are both testable and potentially interesting from experience, HARPA learns a reward model that scores new hypotheses based on prior experimental outcomes, achieving approx. a 28\% absolute gain over HARPA's untrained baseline scorer. Together, these methods represent a step forward in the field of AI-driven scientific discovery.

Related papers

Accelerating Social Science Research via Agentic Hypothesization and Experimentation [33.55093074029515]
EXPERIGEN is a framework that operationalizes end-to-end discovery through a Bayesian optimization inspired two-phase search.<n>It consistently discovers 2-4x more statistically significant hypotheses that are 7-17 percent more predictive than prior approaches.<n>We conduct the first A/B test of LLM-generated hypotheses, observing statistically significant results with p less than 1e-6 and a large effect size of 344 percent.
arXiv Detail & Related papers (2026-02-08T14:20:56Z)
FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights [63.32178443510396]
We introduce FIRE-Bench (Full-cycle Insight Rediscovery Evaluation), a benchmark that evaluates agents through the rediscovery of established findings.<n>Even the strongest agents achieve limited rediscovery success (50 F1), exhibit high variance across runs, and display recurring failure modes in experimental design, execution, and evidence-based reasoning.
arXiv Detail & Related papers (2026-02-02T23:21:13Z)
Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows [203.3527268311731]
We present an operational SGI definition grounded in the Practical Inquiry Model (PIM)<n>We operationalize it via four scientist-aligned tasks: deep research, idea generation, dry/wet experiments, and experimental reasoning.<n>Our PIM-grounded definition, workflow-centric benchmark, and empirical insights establish a foundation for AI systems that genuinely participate in scientific discovery.
arXiv Detail & Related papers (2025-12-18T12:44:36Z)
Bayes-Entropy Collaborative Driven Agents for Research Hypotheses Generation and Optimization [4.469102316542763]
This paper proposes a multi-agent collaborative framework called HypoAgents.<n>It generates hypotheses through diversity sampling and establishes prior beliefs.<n>It then employs etrieval-augmented generation (RAG) to gather external literature evidence.<n>It identifies high-uncertainty hypotheses using information entropy $H = - sum p_ilog p_i$ and actively refines them.
arXiv Detail & Related papers (2025-08-03T13:05:32Z)
Open-ended Scientific Discovery via Bayesian Surprise [63.26412847240136]
AutoDS is a method for open-ended scientific discovery that instead drives scientific exploration using Bayesian surprise.<n>We evaluate AutoDS in the setting of data-driven discovery across 21 real-world datasets spanning domains such as biology, economics, finance, and behavioral science.
arXiv Detail & Related papers (2025-06-30T22:53:59Z)
MOOSE-Chem3: Toward Experiment-Guided Hypothesis Ranking via Simulated Experimental Feedback [128.2992631982687]
We introduce the task of experiment-guided ranking, which aims to prioritize candidate hypotheses based on the results of previously tested ones.<n>We propose a simulator grounded in three domain-informed assumptions, modeling hypothesis performance as a function of similarity to a known ground truth hypothesis.<n>We curate a dataset of 124 chemistry hypotheses with experimentally reported outcomes to validate the simulator.
arXiv Detail & Related papers (2025-05-23T13:24:50Z)
ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition [67.26124739345332]
Large language models (LLMs) have demonstrated potential in assisting scientific research, yet their ability to discover high-quality research hypotheses remains unexamined.<n>We introduce the first large-scale benchmark for evaluating LLMs with a near-sufficient set of sub-tasks of scientific discovery.<n>We develop an automated framework that extracts critical components - research questions, background surveys, inspirations, and hypotheses - from scientific papers.
arXiv Detail & Related papers (2025-03-27T08:09:15Z)
Large Language Models for Automated Open-domain Scientific Hypotheses Discovery [50.40483334131271]
This work proposes the first dataset for social science academic hypotheses discovery. Unlike previous settings, the new dataset requires (1) using open-domain data (raw web corpus) as observations; and (2) proposing hypotheses even new to humanity. A multi- module framework is developed for the task, including three different feedback mechanisms to boost performance.
arXiv Detail & Related papers (2023-09-06T05:19:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.