LiveIdeaBench: Evaluating LLMs' Scientific Creativity and Idea Generation with Minimal Context
- URL: http://arxiv.org/abs/2412.17596v2
- Date: Sun, 12 Jan 2025 03:52:16 GMT
- Title: LiveIdeaBench: Evaluating LLMs' Scientific Creativity and Idea Generation with Minimal Context
- Authors: Kai Ruan, Xuan Wang, Jixiang Hong, Peng Wang, Yang Liu, Hao Sun,
- Abstract summary: We introduce LiveIdeaBench, a benchmark that evaluates Large Language Models' scientific creativity and divergent thinking capabilities.
Our framework employs a dynamic panel of state-of-the-art LLMs to assess generated ideas across four key dimensions: originality, feasibility, fluency, and flexibility.
- Score: 13.967898012303325
- License:
- Abstract: While Large Language Models (LLMs) have demonstrated remarkable capabilities in scientific tasks, existing evaluation frameworks primarily assess their performance using rich contextual inputs, overlooking their ability to generate novel ideas from minimal information. We introduce LiveIdeaBench, a comprehensive benchmark that evaluates LLMs' scientific creativity and divergent thinking capabilities using single-keyword prompts. Drawing from Guilford's creativity theory, our framework employs a dynamic panel of state-of-the-art LLMs to assess generated ideas across four key dimensions: originality, feasibility, fluency, and flexibility. Through extensive experimentation with 20 leading models across 1,180 keywords spanning 18 scientific domains, we reveal that scientific creative ability shows distinct patterns from general intelligence metrics. Notably, our results demonstrate that models like QwQ-32B-preview achieve comparable creative performance to top-tier models like o1-preview, despite significant gaps in their general intelligence scores. These findings highlight the importance of specialized evaluation frameworks for scientific creativity and suggest that the development of creative capabilities in LLMs may follow different trajectories than traditional problem-solving abilities.
Related papers
- A Causality-aware Paradigm for Evaluating Creativity of Multimodal Large Language Models [100.16387798660833]
Oogiri game is a creativity-driven task requiring humor and associative thinking.
LoTbench is an interactive, causality-aware evaluation framework.
Results show that while most LLMs exhibit constrained creativity, the performance gap between LLMs and humans is not insurmountable.
arXiv Detail & Related papers (2025-01-25T09:11:15Z) - Explore Theory of Mind: Program-guided adversarial data generation for theory of mind reasoning [88.68573198200698]
We introduce ExploreToM, the first framework to allow large-scale generation of diverse and challenging theory of mind data.
Our approach leverages an A* search over a custom domain-specific language to produce complex story structures and novel, diverse, yet plausible scenarios.
Our evaluation reveals that state-of-the-art LLMs, such as Llama-3.1-70B and GPT-4o, show accuracies as low as 0% and 9% on ExploreToM-generated data.
arXiv Detail & Related papers (2024-12-12T21:29:00Z) - Good Idea or Not, Representation of LLM Could Tell [86.36317971482755]
We focus on idea assessment, which aims to leverage the knowledge of large language models to assess the merit of scientific ideas.
We release a benchmark dataset from nearly four thousand manuscript papers with full texts, meticulously designed to train and evaluate the performance of different approaches to this task.
Our findings suggest that the representations of large language models hold more potential in quantifying the value of ideas than their generative outputs.
arXiv Detail & Related papers (2024-09-07T02:07:22Z) - Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers [90.26363107905344]
Large language models (LLMs) have sparked optimism about their potential to accelerate scientific discovery.
No evaluations have shown that LLM systems can take the very first step of producing novel, expert-level ideas.
arXiv Detail & Related papers (2024-09-06T08:25:03Z) - Benchmarking Language Model Creativity: A Case Study on Code Generation [39.546827184857754]
In this work, we introduce a framework for quantifying LLM creativity.
We define NEOGAUGE, a metric that quantifies both convergent and divergent thinking in the generated creative responses.
We test the proposed framework on Codeforces problems, which serve as both a natural dataset for coding tasks and a collection of prior human solutions.
arXiv Detail & Related papers (2024-07-12T05:55:22Z) - LLM and Simulation as Bilevel Optimizers: A New Paradigm to Advance Physical Scientific Discovery [141.39722070734737]
We propose to enhance the knowledge-driven, abstract reasoning abilities of Large Language Models with the computational strength of simulations.
We introduce Scientific Generative Agent (SGA), a bilevel optimization framework.
We conduct experiments to demonstrate our framework's efficacy in law discovery and molecular design.
arXiv Detail & Related papers (2024-05-16T03:04:10Z) - Divergent Creativity in Humans and Large Language Models [37.67363469600804]
The recent surge in the capabilities of Large Language Models has led to claims that they are approaching a level of creativity akin to human capabilities.
We leverage recent advances in creativity science to build a framework for in-depth analysis of divergent creativity in both state-of-the-art LLMs and a substantial dataset of 100,000 humans.
arXiv Detail & Related papers (2024-05-13T22:37:52Z) - Assessing and Understanding Creativity in Large Language Models [33.37237667182931]
This paper aims to establish an efficient framework for assessing the level of creativity in large language models (LLMs)
By adapting the Torrance Tests of Creative Thinking, the research evaluates the creative performance of various LLMs across 7 tasks.
We found that the creativity of LLMs primarily falls short in originality, while excelling in elaboration.
arXiv Detail & Related papers (2024-01-23T05:19:47Z) - Exploring the Cognitive Knowledge Structure of Large Language Models: An
Educational Diagnostic Assessment Approach [50.125704610228254]
Large Language Models (LLMs) have not only exhibited exceptional performance across various tasks, but also demonstrated sparks of intelligence.
Recent studies have focused on assessing their capabilities on human exams and revealed their impressive competence in different domains.
We conduct an evaluation using MoocRadar, a meticulously annotated human test dataset based on Bloom taxonomy.
arXiv Detail & Related papers (2023-10-12T09:55:45Z) - Automatic Creativity Measurement in Scratch Programs Across Modalities [6.242018846706069]
We make the journey fromdefining a formal measure of creativity that is efficientlycomputable to applying the measure in a practical domain.
We adapted the general measure for projects in the popular visual programming language Scratch.
We designed a machine learning model for predicting the creativity of Scratch projects, trained and evaluated on human expert creativity assessments.
arXiv Detail & Related papers (2022-11-07T10:43:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.