Related papers: HypoSpace: Evaluating LLM Creativity as Set-Valued Hypothesis Generators under Underdetermination

HypoSpace: Evaluating LLM Creativity as Set-Valued Hypothesis Generators under Underdetermination

URL: http://arxiv.org/abs/2510.15614v1
Date: Fri, 17 Oct 2025 13:00:32 GMT
Title: HypoSpace: Evaluating LLM Creativity as Set-Valued Hypothesis Generators under Underdetermination
Authors: Tingting Chen, Beibei Lin, Zifeng Yuan, Qiran Zou, Hongyu He, Yew-Soon Ong, Anirudh Goyal, Dianbo Liu,
Abstract summary: We introduce HypoSpace, a diagnostic suite that treats LLMs as samplers of finite hypothesis sets.<n>We instantiate HypoSpace in three structured domains with deterministic validators and exactly enumerated hypothesis spaces.<n>Across instruction-tuned and reasoning-focused models, Validity often remains high while Uniqueness and Recovery degrade as the admissible space grows.
Score: 46.896452542901805
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As language models are increasingly used in scientific workflows, evaluating their ability to propose sets of explanations-not just a single correct answer-becomes critical. Many scientific problems are underdetermined: multiple, mechanistically distinct hypotheses are consistent with the same observations. We introduce HypoSpace, a diagnostic suite that treats LLMs as samplers of finite hypothesis sets and measures three complementary indicators: Validity (precision of proposals consistent with observations), Uniqueness (non-redundancy among proposals), and Recovery (coverage of the enumerated admissible set). We instantiate HypoSpace in three structured domains with deterministic validators and exactly enumerated hypothesis spaces: (i) causal graphs from perturbations, (ii) gravity-constrained 3D voxel reconstruction from top-down projections, and (iii) Boolean genetic interactions. Across instruction-tuned and reasoning-focused models, Validity often remains high while Uniqueness and Recovery degrade as the admissible space grows, revealing mode collapse that is invisible to correctness-only metrics. HypoSpace offers a controlled probe-rather than a leaderboard-for methods that explicitly explore and cover admissible explanation spaces. Code is available at: https://github.com/CTT-Pavilion/_HypoSpace.

Related papers

Hypothesis Testing over Observable Regimes in Singular Models [0.12183405753834557]
We show that the fundamental obstruction to testing in singular statistical models is not singularity itself, but the formulation of hypotheses on non-identifiable parameter quantities.<n>We formalize this overlap obstruction and show that hypotheses depending on non-identifiable parameter functions necessarily fail in this sense.<n>In contrast, hypotheses formulated over identifiable observables-quantities that are determined by the induced distribution-reduce entirely to classical testing theory.
arXiv Detail & Related papers (2026-02-27T16:44:29Z)
Differentially private testing for relevant dependencies in high dimensions [1.809722301908016]
We investigate the problem of detecting dependencies between the components of a high-dimensional vector.<n>Instead of testing whether the coordinates are pairwise independent, we are interested in determining whether certain pairwise associations do not exceed a given threshold in absolute value.<n>We propose a novel bootstrap based methodology that is especially powerful in sparse settings.
arXiv Detail & Related papers (2025-11-21T11:38:40Z)
Towards Inference-time Scaling for Continuous Space Reasoning [55.40260529506702]
Inference-time scaling has proven effective for text-based reasoning in large language models.<n>This paper investigates whether such established techniques can be successfully adapted to reasoning in the continuous space.<n>We demonstrate the feasibility of generating diverse reasoning paths through dropout-based sampling.
arXiv Detail & Related papers (2025-10-14T05:53:41Z)
Controllable Logical Hypothesis Generation for Abductive Reasoning in Knowledge Graphs [54.596180382762036]
Abductive reasoning in knowledge graphs aims to generate plausible logical hypotheses from observed entities.<n>Due to a lack of controllability, a single observation may yield numerous plausible but redundant or irrelevant hypotheses.<n>We introduce the task of controllable hypothesis generation to improve the practical utility of abductive reasoning.
arXiv Detail & Related papers (2025-05-27T09:36:47Z)
MOOSE-Chem2: Exploring LLM Limits in Fine-Grained Scientific Hypothesis Discovery via Hierarchical Search [102.11776494401705]
Large language models (LLMs) have shown promise in automating scientific hypothesis generation.<n>Existing approaches primarily yield coarse-grained hypotheses lacking critical methodological and experimental details.<n>We introduce and formally define the new task of fine-grained scientific hypothesis discovery.
arXiv Detail & Related papers (2025-05-25T16:13:46Z)
Automated Hypothesis Validation with Agentic Sequential Falsifications [45.572893831500686]
Many real-world hypotheses are abstract, high-level statements that are difficult to validate directly.<n>Here we propose Popper, an agentic framework for rigorous automated validation of free-form hypotheses.
arXiv Detail & Related papers (2025-02-14T01:46:00Z)
Simultaneous inference for generalized linear models with unmeasured confounders [0.0]
We propose a unified statistical estimation and inference framework that harnesses structures and integrates linear projections into three key stages.<n>We show effective Type-I error control of $z$-tests as sample and response sizes approach infinity.
arXiv Detail & Related papers (2023-09-13T18:53:11Z)
Large Language Models for Automated Open-domain Scientific Hypotheses Discovery [50.40483334131271]
This work proposes the first dataset for social science academic hypotheses discovery. Unlike previous settings, the new dataset requires (1) using open-domain data (raw web corpus) as observations; and (2) proposing hypotheses even new to humanity. A multi- module framework is developed for the task, including three different feedback mechanisms to boost performance.
arXiv Detail & Related papers (2023-09-06T05:19:41Z)
Diverse, Global and Amortised Counterfactual Explanations for Uncertainty Estimates [31.241489953967694]
We study the diversity of such sets and find that many CLUEs are redundant. We then propose GLobal AMortised CLUE (GLAM-CLUE), a distinct and novel method which learns amortised mappings on specific groups of uncertain inputs. Our experiments show that $delta$-CLUE, $nabla$-CLUE, and GLAM-CLUE all address shortcomings of CLUE and provide beneficial explanations of uncertainty estimates to practitioners.
arXiv Detail & Related papers (2021-12-05T18:27:21Z)
Asymptotic relative submajorization of multiple-state boxes [0.0]
Pairs of states are the basic objects in the resource theory of asymmetric distinguishability (Wang and Wilde, 2019), where free operations are arbitrary quantum channels that are applied to both states. We consider boxes of a fixed finite number of states and study an extension of the relative submajorization preorder to such objects. This preorder characterizes error probabilities in the case of testing a composite null hypothesis against a simple alternative hypothesis, as well as certain error probabilities in state discrimination.
arXiv Detail & Related papers (2020-07-22T08:29:52Z)
Empirically Verifying Hypotheses Using Reinforcement Learning [58.09414653169534]
This paper formulates hypothesis verification as an RL problem. We aim to build an agent that, given a hypothesis about the dynamics of the world, can take actions to generate observations which can help predict whether the hypothesis is true or false.
arXiv Detail & Related papers (2020-06-29T01:01:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.