Related papers: LiveIdeaBench: Evaluating LLMs' Divergent Thinking for Scientific Idea Generation with Minimal Context

LiveIdeaBench: Evaluating LLMs' Divergent Thinking for Scientific Idea Generation with Minimal Context

URL: http://arxiv.org/abs/2412.17596v3
Date: Mon, 28 Apr 2025 06:12:14 GMT
Title: LiveIdeaBench: Evaluating LLMs' Divergent Thinking for Scientific Idea Generation with Minimal Context
Authors: Kai Ruan, Xuan Wang, Jixiang Hong, Peng Wang, Yang Liu, Hao Sun,
Abstract summary: We introduce LiveIdeaBench, a benchmark evaluating Large Language Models' scientific idea generation.<n>Our benchmark employs a dynamic panel of state-of-the-art LLMs to assess generated ideas across five key dimensions: originality, feasibility, fluency, flexibility, and clarity.<n>Our results demonstrate that models like QwQ-32B-preview achieve creative performance comparable to top-tier models such as claude-3.7-sonnet:thinking, despite significant gaps in their general intelligence scores.
Score: 13.967898012303325
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While Large Language Models (LLMs) demonstrate remarkable capabilities in scientific tasks such as literature analysis and experimental design (e.g., accurately extracting key findings from papers or generating coherent experimental procedures), existing evaluation benchmarks primarily assess performance using rich contextual inputs. We introduce LiveIdeaBench, a comprehensive benchmark evaluating LLMs' scientific idea generation by assessing divergent thinking capabilities using single-keyword prompts. Drawing from Guilford's creativity theory, our benchmark employs a dynamic panel of state-of-the-art LLMs to assess generated ideas across five key dimensions: originality, feasibility, fluency, flexibility, and clarity. Through extensive experimentation with over 40 leading models across 1,180 keywords spanning 22 scientific domains, we reveal that the scientific idea generation capabilities measured by our benchmark, are poorly predicted by standard metrics of general intelligence. Our results demonstrate that models like QwQ-32B-preview achieve creative performance comparable to top-tier models such as claude-3.7-sonnet:thinking, despite significant gaps in their general intelligence scores. These findings highlight the need for specialized evaluation benchmarks for scientific idea generation and suggest that enhancing these idea generation capabilities in LLMs may require different training strategies than those used for improving general problem-solving abilities, potentially enabling a wider range of AI tools tailored for different stages of the scientific process.

Related papers

Reward Modeling for Scientific Writing Evaluation [50.33952894976367]
It is critical to develop models that can be reliably deployed for evaluating diverse open-ended scientific writing tasks.<n>We propose cost-efficient, open-source reward models tailored for scientific writing evaluation.
arXiv Detail & Related papers (2026-01-16T15:32:58Z)
HiSciBench: A Hierarchical Multi-disciplinary Benchmark for Scientific Intelligence from Reading to Discovery [50.8841471967624]
HiSciBench is a hierarchical benchmark designed to evaluate foundation models across five levels that mirror the complete scientific workflow.<n>HiSciBench contains 8,735 carefully curated instances spanning six major scientific disciplines.
arXiv Detail & Related papers (2025-12-28T12:08:05Z)
Evaluating Large Language Models in Scientific Discovery [91.732562776782]
Large language models (LLMs) are increasingly applied to scientific research, yet prevailing science benchmarks probe decontextualized knowledge.<n>We introduce a scenario-grounded benchmark that evaluates LLMs across biology, chemistry, materials, and physics.<n>The framework assesses models at two levels: (i) question-level accuracy on scenario-tied items and (ii) project-level performance.
arXiv Detail & Related papers (2025-12-17T16:20:03Z)
MMReview: A Multidisciplinary and Multimodal Benchmark for LLM-Based Peer Review Automation [24.759077885472678]
Large Language Models (LLMs) have increasingly been adopted to assist in the generation of review comments.<n>Current LLM-based review tasks lack a unified evaluation benchmark to rigorously assess the models' ability to produce comprehensive, accurate, and human-aligned assessments.<n>We propose textbfMMReview, a comprehensive benchmark that spans multiple disciplines and modalities.
arXiv Detail & Related papers (2025-08-19T16:37:19Z)
GUIDE: Towards Scalable Advising for Research Ideas [9.819083407389524]
We develop a system to provide high-quality, well-reasoned feedback to refine proposed hypotheses and experimental designs.<n>Our system achieves an acceptance rate exceeding 90% on the ICLR 2025 test set.
arXiv Detail & Related papers (2025-07-09T17:59:21Z)
Dynamic Knowledge Exchange and Dual-diversity Review: Concisely Unleashing the Potential of a Multi-Agent Research Team [53.38438460574943]
IDVSCI is a multi-agent framework built on large language models (LLMs)<n>It incorporates two key innovations: a Dynamic Knowledge Exchange mechanism and a Dual-Diversity Review paradigm.<n>Results show that IDVSCI consistently achieves the best performance across two datasets.
arXiv Detail & Related papers (2025-06-23T07:12:08Z)
Revisiting LLM Evaluation through Mechanism Interpretability: a New Metric and Model Utility Law [99.56567010306807]
Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications. We propose a novel metric, the Model Utilization Index (MUI), which introduces mechanism interpretability techniques to complement traditional performance metrics.
arXiv Detail & Related papers (2025-04-10T04:09:47Z)
Explore Theory of Mind: Program-guided adversarial data generation for theory of mind reasoning [88.68573198200698]
We introduce ExploreToM, the first framework to allow large-scale generation of diverse and challenging theory of mind data. Our approach leverages an A* search over a custom domain-specific language to produce complex story structures and novel, diverse, yet plausible scenarios. Our evaluation reveals that state-of-the-art LLMs, such as Llama-3.1-70B and GPT-4o, show accuracies as low as 0% and 9% on ExploreToM-generated data.
arXiv Detail & Related papers (2024-12-12T21:29:00Z)
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs [97.94579295913606]
Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia. In the development process, evaluation is critical since it provides intuitive feedback and guidance on improving models. This work aims to offer researchers an easy grasp of how to effectively evaluate MLLMs according to different needs and to inspire better evaluation methods.
arXiv Detail & Related papers (2024-11-22T18:59:54Z)
IdeaBench: Benchmarking Large Language Models for Research Idea Generation [19.66218274796796]
Large Language Models (LLMs) have transformed how people interact with artificial intelligence (AI) systems. We propose IdeaBench, a benchmark system that includes a comprehensive dataset and an evaluation framework. Our dataset comprises titles and abstracts from a diverse range of influential papers, along with their referenced works. Our evaluation framework is a two-stage process: first, using GPT-4o to rank ideas based on user-specified quality indicators such as novelty and feasibility, enabling scalable personalization.
arXiv Detail & Related papers (2024-10-31T17:04:59Z)
A Novel Psychometrics-Based Approach to Developing Professional Competency Benchmark for Large Language Models [0.0]
We propose a comprehensive approach to benchmark development based on rigorous psychometric principles. We make the first attempt to illustrate this approach by creating a new benchmark in the field of pedagogy and education. We construct a novel benchmark guided by the Bloom's taxonomy and rigorously designed by a consortium of education experts trained in test development.
arXiv Detail & Related papers (2024-10-29T19:32:43Z)
Good Idea or Not, Representation of LLM Could Tell [86.36317971482755]
We focus on idea assessment, which aims to leverage the knowledge of large language models to assess the merit of scientific ideas. We release a benchmark dataset from nearly four thousand manuscript papers with full texts, meticulously designed to train and evaluate the performance of different approaches to this task. Our findings suggest that the representations of large language models hold more potential in quantifying the value of ideas than their generative outputs.
arXiv Detail & Related papers (2024-09-07T02:07:22Z)
Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers [90.26363107905344]
Large language models (LLMs) have sparked optimism about their potential to accelerate scientific discovery. No evaluations have shown that LLM systems can take the very first step of producing novel, expert-level ideas.
arXiv Detail & Related papers (2024-09-06T08:25:03Z)
Benchmarks as Microscopes: A Call for Model Metrology [76.64402390208576]
Modern language models (LMs) pose a new challenge in capability assessment. To be confident in our metrics, we need a new discipline of model metrology.
arXiv Detail & Related papers (2024-07-22T17:52:12Z)
Can I understand what I create? Self-Knowledge Evaluation of Large Language Models [31.85129258347539]
Large language models (LLMs) have achieved remarkable progress in linguistic tasks. Inspired by Feynman's principle of understanding through creation, we introduce a self-knowledge evaluation framework.
arXiv Detail & Related papers (2024-06-10T09:53:54Z)
Creativity Has Left the Chat: The Price of Debiasing Language Models [1.223779595809275]
We investigate the unintended consequences of Reinforcement Learning from Human Feedback on the creativity of Large Language Models (LLMs) Our findings have significant implications for marketers who rely on LLMs for creative tasks such as copywriting, ad creation, and customer persona generation.
arXiv Detail & Related papers (2024-06-08T22:14:51Z)
LLM and Simulation as Bilevel Optimizers: A New Paradigm to Advance Physical Scientific Discovery [141.39722070734737]
We propose to enhance the knowledge-driven, abstract reasoning abilities of Large Language Models with the computational strength of simulations. We introduce Scientific Generative Agent (SGA), a bilevel optimization framework. We conduct experiments to demonstrate our framework's efficacy in law discovery and molecular design.
arXiv Detail & Related papers (2024-05-16T03:04:10Z)
Divergent Creativity in Humans and Large Language Models [37.67363469600804]
The recent surge in the capabilities of Large Language Models has led to claims that they are approaching a level of creativity akin to human capabilities. We leverage recent advances in creativity science to build a framework for in-depth analysis of divergent creativity in both state-of-the-art LLMs and a substantial dataset of 100,000 humans.
arXiv Detail & Related papers (2024-05-13T22:37:52Z)
Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence [5.147767778946168]
We critically assess 23 state-of-the-art Large Language Models (LLMs) benchmarks. Our research uncovered significant limitations, including biases, difficulties in measuring genuine reasoning, adaptability, implementation inconsistencies, prompt engineering complexity, diversity, and the overlooking of cultural and ideological norms.
arXiv Detail & Related papers (2024-02-15T11:08:10Z)
Assessing and Understanding Creativity in Large Language Models [33.37237667182931]
This paper aims to establish an efficient framework for assessing the level of creativity in large language models (LLMs) By adapting the Torrance Tests of Creative Thinking, the research evaluates the creative performance of various LLMs across 7 tasks. We found that the creativity of LLMs primarily falls short in originality, while excelling in elaboration.
arXiv Detail & Related papers (2024-01-23T05:19:47Z)
Can AI Be as Creative as Humans? [84.43873277557852]
We prove in theory that AI can be as creative as humans under the condition that it can properly fit the data generated by human creators. The debate on AI's creativity is reduced into the question of its ability to fit a sufficient amount of data.
arXiv Detail & Related papers (2024-01-03T08:49:12Z)
Exploring the Cognitive Knowledge Structure of Large Language Models: An Educational Diagnostic Assessment Approach [50.125704610228254]
Large Language Models (LLMs) have not only exhibited exceptional performance across various tasks, but also demonstrated sparks of intelligence. Recent studies have focused on assessing their capabilities on human exams and revealed their impressive competence in different domains. We conduct an evaluation using MoocRadar, a meticulously annotated human test dataset based on Bloom taxonomy.
arXiv Detail & Related papers (2023-10-12T09:55:45Z)
Automatic Creativity Measurement in Scratch Programs Across Modalities [6.242018846706069]
We make the journey fromdefining a formal measure of creativity that is efficientlycomputable to applying the measure in a practical domain. We adapted the general measure for projects in the popular visual programming language Scratch. We designed a machine learning model for predicting the creativity of Scratch projects, trained and evaluated on human expert creativity assessments.
arXiv Detail & Related papers (2022-11-07T10:43:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.