Literature Meets Data: A Synergistic Approach to Hypothesis Generation
- URL: http://arxiv.org/abs/2410.17309v2
- Date: Tue, 19 Nov 2024 23:32:13 GMT
- Title: Literature Meets Data: A Synergistic Approach to Hypothesis Generation
- Authors: Haokun Liu, Yangqiaoyu Zhou, Mingxuan Li, Chenfei Yuan, Chenhao Tan,
- Abstract summary: We develop the first method that combines literature-based insights with data to perform hypothesis generation.
We also conduct the first human evaluation to assess the utility of LLM-generated hypotheses in assisting human decision-making.
- Score: 24.98928229927995
- License:
- Abstract: AI holds promise for transforming scientific processes, including hypothesis generation. Prior work on hypothesis generation can be broadly categorized into theory-driven and data-driven approaches. While both have proven effective in generating novel and plausible hypotheses, it remains an open question whether they can complement each other. To address this, we develop the first method that combines literature-based insights with data to perform LLM-powered hypothesis generation. We apply our method on five different datasets and demonstrate that integrating literature and data outperforms other baselines (8.97\% over few-shot, 15.75\% over literature-based alone, and 3.37\% over data-driven alone). Additionally, we conduct the first human evaluation to assess the utility of LLM-generated hypotheses in assisting human decision-making on two challenging tasks: deception detection and AI generated content detection. Our results show that human accuracy improves significantly by 7.44\% and 14.19\% on these tasks, respectively. These findings suggest that integrating literature-based and data-driven approaches provides a comprehensive and nuanced framework for hypothesis generation and could open new avenues for scientific inquiry.
Related papers
- Large Language Models as Biomedical Hypothesis Generators: A Comprehensive Evaluation [15.495976478018264]
Large language models (LLMs) have emerged as a promising tool to revolutionize knowledge interaction.
We construct a dataset of background-hypothesis pairs from biomedical literature, partitioned into training, seen, and unseen test sets.
We assess the hypothesis generation capabilities of top-tier instructed models in zero-shot, few-shot, and fine-tuning settings.
arXiv Detail & Related papers (2024-07-12T02:55:13Z) - Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs)
We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs.
We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z) - Autonomous LLM-driven research from data to human-verifiable research papers [0.0]
We build an automation platform that guides interacting through complete stepwise process.
In mode provided annotated data alone, datapaper raised hypotheses, designed plans, wrote and interpreted analysis codes, generated and interpreted results.
We demonstrate potential for AI-driven acceleration of scientific discovery while enhancing traceability, transparency and verifiability.
arXiv Detail & Related papers (2024-04-24T23:15:49Z) - Hypothesis Generation with Large Language Models [28.73562677221476]
We focus on hypothesis generation based on data (i.e., labeled examples)
Inspired by multi-armed bandits, we design a reward function to inform the exploitation-exploration tradeoff in the update process.
Our algorithm is able to generate hypotheses that enable much better predictive performance than few-shot prompting in classification tasks.
arXiv Detail & Related papers (2024-04-05T18:00:07Z) - Towards Biologically Plausible and Private Gene Expression Data
Generation [47.72947816788821]
Generative models trained with Differential Privacy (DP) are becoming increasingly prominent in the creation of synthetic data for downstream applications.
Existing literature, however, primarily focuses on basic benchmarking datasets and tends to report promising results only for elementary metrics and relatively simple data distributions.
We initiate a systematic analysis of how DP generative models perform in their natural application scenarios, specifically focusing on real-world gene expression data.
arXiv Detail & Related papers (2024-02-07T14:39:11Z) - Source-Free Unsupervised Domain Adaptation with Hypothesis Consolidation
of Prediction Rationale [53.152460508207184]
Source-Free Unsupervised Domain Adaptation (SFUDA) is a challenging task where a model needs to be adapted to a new domain without access to target domain labels or source domain data.
This paper proposes a novel approach that considers multiple prediction hypotheses for each sample and investigates the rationale behind each hypothesis.
To achieve the optimal performance, we propose a three-step adaptation process: model pre-adaptation, hypothesis consolidation, and semi-supervised learning.
arXiv Detail & Related papers (2024-02-02T05:53:22Z) - Large Language Models are Zero Shot Hypothesis Proposers [17.612235393984744]
Large Language Models (LLMs) hold a wealth of global and interdisciplinary knowledge that promises to break down information barriers.
We construct a dataset consist of background knowledge and hypothesis pairs from biomedical literature.
We evaluate the hypothesis generation capabilities of various top-tier instructed models in zero-shot, few-shot, and fine-tuning settings.
arXiv Detail & Related papers (2023-11-10T10:03:49Z) - Large Language Models for Automated Open-domain Scientific Hypotheses Discovery [50.40483334131271]
This work proposes the first dataset for social science academic hypotheses discovery.
Unlike previous settings, the new dataset requires (1) using open-domain data (raw web corpus) as observations; and (2) proposing hypotheses even new to humanity.
A multi- module framework is developed for the task, including three different feedback mechanisms to boost performance.
arXiv Detail & Related papers (2023-09-06T05:19:41Z) - Bias and Fairness in Large Language Models: A Survey [73.87651986156006]
We present a comprehensive survey of bias evaluation and mitigation techniques for large language models (LLMs)
We first consolidate, formalize, and expand notions of social bias and fairness in natural language processing.
We then unify the literature by proposing three intuitive, two for bias evaluation, and one for mitigation.
arXiv Detail & Related papers (2023-09-02T00:32:55Z) - Natural Language Inference with Self-Attention for Veracity Assessment
of Pandemic Claims [54.93898455714295]
We first describe the construction of the novel PANACEA dataset consisting of heterogeneous claims on COVID-19.
We then propose novel techniques for automated veracity assessment based on Natural Language Inference.
arXiv Detail & Related papers (2022-05-05T12:11:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.