Related papers: Can Large Language Models Discern Evidence for Scientific Hypotheses? Case Studies in the Social Sciences

Can Large Language Models Discern Evidence for Scientific Hypotheses? Case Studies in the Social Sciences

URL: http://arxiv.org/abs/2309.06578v3
Date: Tue, 26 Mar 2024 03:33:45 GMT
Title: Can Large Language Models Discern Evidence for Scientific Hypotheses? Case Studies in the Social Sciences
Authors: Sai Koneru, Jian Wu, Sarah Rajtmajer,
Abstract summary: A strong hypothesis is a best guess based on existing evidence and informed by a comprehensive view of relevant literature. With exponential increase in the number of scientific articles published annually, manual aggregation and synthesis of evidence related to a given hypothesis is a challenge. We share a novel dataset for the task of scientific hypothesis evidencing using community-driven annotations of studies in the social sciences.
Score: 3.9985385067438344
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Hypothesis formulation and testing are central to empirical research. A strong hypothesis is a best guess based on existing evidence and informed by a comprehensive view of relevant literature. However, with exponential increase in the number of scientific articles published annually, manual aggregation and synthesis of evidence related to a given hypothesis is a challenge. Our work explores the ability of current large language models (LLMs) to discern evidence in support or refute of specific hypotheses based on the text of scientific abstracts. We share a novel dataset for the task of scientific hypothesis evidencing using community-driven annotations of studies in the social sciences. We compare the performance of LLMs to several state-of-the-art benchmarks and highlight opportunities for future research in this area. The dataset is available at https://github.com/Sai90000/ScientificHypothesisEvidencing.git

Related papers

Evaluating Large Language Models in Scientific Discovery [91.732562776782]
Large language models (LLMs) are increasingly applied to scientific research, yet prevailing science benchmarks probe decontextualized knowledge.<n>We introduce a scenario-grounded benchmark that evaluates LLMs across biology, chemistry, materials, and physics.<n>The framework assesses models at two levels: (i) question-level accuracy on scenario-tied items and (ii) project-level performance.
arXiv Detail & Related papers (2025-12-17T16:20:03Z)
MOOSE-Chem2: Exploring LLM Limits in Fine-Grained Scientific Hypothesis Discovery via Hierarchical Search [93.64235254640967]
Large language models (LLMs) have shown promise in automating scientific hypothesis generation.<n>We define the novel task of fine-grained scientific hypothesis discovery.<n>We propose a hierarchical search method that incrementally proposes and integrates details into the hypothesis.
arXiv Detail & Related papers (2025-05-25T16:13:46Z)
Toward Reliable Scientific Hypothesis Generation: Evaluating Truthfulness and Hallucination in Large Language Models [18.850296587858946]
We introduce TruthHypo, a benchmark for assessing the capabilities of large language models in generating truthful hypotheses.<n>KnowHD is a knowledge-based hallucination detector to evaluate how well hypotheses are grounded in existing knowledge.
arXiv Detail & Related papers (2025-05-20T16:49:40Z)
HypoBench: Towards Systematic and Principled Benchmarking for Hypothesis Generation [24.656083479331645]
We introduce HypoBench, a novel benchmark designed to evaluate hypothesis generation methods across multiple aspects. We evaluate four state-of-the-art LLMs combined with six existing hypothesis-generation methods. Results indicate that there is still significant room for improvement, as current hypothesis generation methods do not fully uncover all relevant or meaningful patterns.
arXiv Detail & Related papers (2025-04-15T18:00:00Z)
ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition [67.26124739345332]
Large language models (LLMs) have demonstrated potential in assisting scientific research, yet their ability to discover high-quality research hypotheses remains unexamined. We introduce the first large-scale benchmark for evaluating LLMs with a near-sufficient set of sub-tasks of scientific discovery. We develop an automated framework that extracts critical components - research questions, background surveys, inspirations, and hypotheses - from scientific papers.
arXiv Detail & Related papers (2025-03-27T08:09:15Z)
BoxingGym: Benchmarking Progress in Automated Experimental Design and Model Discovery [24.630117520005257]
We introduce BoxingGym, a benchmark with 10 environments for evaluating experimental design and model discovery. We compute the expected information gain (EIG), an information-theoretic quantity which measures how much an experiment reduces uncertainty about the parameters of a generative model. We find that current LLMs, such as GPT-4o, struggle with both experimental design and model discovery.
arXiv Detail & Related papers (2025-01-02T21:15:57Z)
Hypothesizing Missing Causal Variables with LLMs [55.28678224020973]
We formulate a novel task where the input is a partial causal graph with missing variables, and the output is a hypothesis about the missing variables to complete the partial graph. We show the strong ability of LLMs to hypothesize the mediation variables between a cause and its effect. We also observe surprising results where some of the open-source models outperform the closed GPT-4 model.
arXiv Detail & Related papers (2024-09-04T10:37:44Z)
A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery [68.48094108571432]
Large language models (LLMs) have revolutionized the way text and other modalities of data are handled. We aim to provide a more holistic view of the research landscape by unveiling cross-field and cross-modal connections between scientific LLMs.
arXiv Detail & Related papers (2024-06-16T08:03:24Z)
LLM and Simulation as Bilevel Optimizers: A New Paradigm to Advance Physical Scientific Discovery [141.39722070734737]
We propose to enhance the knowledge-driven, abstract reasoning abilities of Large Language Models with the computational strength of simulations. We introduce Scientific Generative Agent (SGA), a bilevel optimization framework. We conduct experiments to demonstrate our framework's efficacy in law discovery and molecular design.
arXiv Detail & Related papers (2024-05-16T03:04:10Z)
Mapping the Increasing Use of LLMs in Scientific Papers [99.67983375899719]
We conduct the first systematic, large-scale analysis across 950,965 papers published between January 2020 and February 2024 on the arXiv, bioRxiv, and Nature portfolio journals. Our findings reveal a steady increase in LLM usage, with the largest and fastest growth observed in Computer Science papers.
arXiv Detail & Related papers (2024-04-01T17:45:15Z)
Large Language Models are Zero Shot Hypothesis Proposers [17.612235393984744]
Large Language Models (LLMs) hold a wealth of global and interdisciplinary knowledge that promises to break down information barriers. We construct a dataset consist of background knowledge and hypothesis pairs from biomedical literature. We evaluate the hypothesis generation capabilities of various top-tier instructed models in zero-shot, few-shot, and fine-tuning settings.
arXiv Detail & Related papers (2023-11-10T10:03:49Z)
Large Language Models for Automated Open-domain Scientific Hypotheses Discovery [50.40483334131271]
This work proposes the first dataset for social science academic hypotheses discovery. Unlike previous settings, the new dataset requires (1) using open-domain data (raw web corpus) as observations; and (2) proposing hypotheses even new to humanity. A multi- module framework is developed for the task, including three different feedback mechanisms to boost performance.
arXiv Detail & Related papers (2023-09-06T05:19:41Z)
Modeling Information Change in Science Communication with Semantically Matched Paraphrases [50.67030449927206]
SPICED is the first paraphrase dataset of scientific findings annotated for degree of information change. SPICED contains 6,000 scientific finding pairs extracted from news stories, social media discussions, and full texts of original papers. Models trained on SPICED improve downstream performance on evidence retrieval for fact checking of real-world scientific claims.
arXiv Detail & Related papers (2022-10-24T07:44:38Z)
Exploring Lexical Irregularities in Hypothesis-Only Models of Natural Language Inference [5.283529004179579]
Natural Language Inference (NLI) or Recognizing Textual Entailment (RTE) is the task of predicting the entailment relation between a pair of sentences. Models that understand entailment should encode both, the premise and the hypothesis. Experiments by Poliak et al. revealed a strong preference of these models towards patterns observed only in the hypothesis.
arXiv Detail & Related papers (2021-01-19T01:08:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.