Related papers: IdeaBench: Benchmarking Large Language Models for Research Idea Generation

IdeaBench: Benchmarking Large Language Models for Research Idea Generation

URL: http://arxiv.org/abs/2411.02429v1
Date: Thu, 31 Oct 2024 17:04:59 GMT
Title: IdeaBench: Benchmarking Large Language Models for Research Idea Generation
Authors: Sikun Guo, Amir Hassan Shariatmadari, Guangzhi Xiong, Albert Huang, Eric Xie, Stefan Bekiranov, Aidong Zhang,
Abstract summary: Large Language Models (LLMs) have transformed how people interact with artificial intelligence (AI) systems. We propose IdeaBench, a benchmark system that includes a comprehensive dataset and an evaluation framework. Our dataset comprises titles and abstracts from a diverse range of influential papers, along with their referenced works. Our evaluation framework is a two-stage process: first, using GPT-4o to rank ideas based on user-specified quality indicators such as novelty and feasibility, enabling scalable personalization.
Score: 19.66218274796796
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large Language Models (LLMs) have transformed how people interact with artificial intelligence (AI) systems, achieving state-of-the-art results in various tasks, including scientific discovery and hypothesis generation. However, the lack of a comprehensive and systematic evaluation framework for generating research ideas using LLMs poses a significant obstacle to understanding and assessing their generative capabilities in scientific discovery. To address this gap, we propose IdeaBench, a benchmark system that includes a comprehensive dataset and an evaluation framework for standardizing the assessment of research idea generation using LLMs. Our dataset comprises titles and abstracts from a diverse range of influential papers, along with their referenced works. To emulate the human process of generating research ideas, we profile LLMs as domain-specific researchers and ground them in the same context considered by human researchers. This maximizes the utilization of the LLMs' parametric knowledge to dynamically generate new research ideas. We also introduce an evaluation framework for assessing the quality of generated research ideas. Our evaluation framework is a two-stage process: first, using GPT-4o to rank ideas based on user-specified quality indicators such as novelty and feasibility, enabling scalable personalization; and second, calculating relative ranking based "Insight Score" to quantify the chosen quality indicator. The proposed benchmark system will be a valuable asset for the community to measure and compare different LLMs, ultimately advancing the automation of the scientific discovery process.

Related papers

Evaluating Large Language Models in Scientific Discovery [91.732562776782]
Large language models (LLMs) are increasingly applied to scientific research, yet prevailing science benchmarks probe decontextualized knowledge.<n>We introduce a scenario-grounded benchmark that evaluates LLMs across biology, chemistry, materials, and physics.<n>The framework assesses models at two levels: (i) question-level accuracy on scenario-tied items and (ii) project-level performance.
arXiv Detail & Related papers (2025-12-17T16:20:03Z)
Graph Your Way to Inspiration: Integrating Co-Author Graphs with Retrieval-Augmented Generation for Large Language Model Based Scientific Idea Generation [1.2232326171442904]
This paper proposes a scientific idea generation system called GYWI.<n>It combines author knowledge graphs with retrieval-augmented generation (RAG) to form an external knowledge base.<n>The generated ideas are evaluated from the following five dimensions: novelty, feasibility, clarity, relevance, and significance.
arXiv Detail & Related papers (2025-12-05T03:38:23Z)
AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research [33.79419161415481]
AbGen is the first benchmark designed to evaluate the capabilities of LLMs in designing ablation studies for scientific research.<n>To better investigate this, we develop AbGen-Eval, a meta-evaluation benchmark designed to assess the reliability of commonly used automated evaluation systems.
arXiv Detail & Related papers (2025-07-17T17:09:22Z)
Improving Research Idea Generation Through Data: An Empirical Investigation in Social Science [25.857554476782827]
This paper explores how augmenting large language models with relevant data during the idea generation process can enhance the quality of generated ideas.<n>We conduct experiments in the social science domain, specifically with climate negotiation topics, and find that metadata improves the feasibility of generated ideas by 20%.<n>A human study shows that LLM-generated ideas, along with their related data and validation processes, inspire researchers to propose research ideas with higher quality.
arXiv Detail & Related papers (2025-05-27T16:23:42Z)
AI Idea Bench 2025: AI Research Idea Generation Benchmark [10.983418515389667]
We present AI Idea Bench 2025, a framework designed to quantitatively evaluate and compare the ideas generated by Language Models (LLMs) The framework comprises a comprehensive dataset of 3,495 AI papers and their associated inspired works, along with a robust evaluation methodology. The evaluation system gauges idea quality in two dimensions: alignment with the ground-truth content of the original papers and judgment based on general reference material.
arXiv Detail & Related papers (2025-04-19T05:35:45Z)
FutureGen: LLM-RAG Approach to Generate the Future Work of Scientific Article [6.682911432177815]
This study generates future work suggestions from key sections of a scientific article alongside related papers. We experimented with various Large Language Models (LLMs) and integrated Retrieval-Augmented Generation (RAG) to enhance the generation process.
arXiv Detail & Related papers (2025-03-20T06:14:02Z)
MLGym: A New Framework and Benchmark for Advancing AI Research Agents [51.9387884953294]
We introduce Meta MLGym and MLGym-Bench, a new framework and benchmark for evaluating and developing large language models on AI research tasks. This is the first Gym environment for machine learning (ML) tasks, enabling research on reinforcement learning (RL) algorithms for training such agents. We evaluate a number of frontier large language models (LLMs) on our benchmarks such as Claude-3.5-Sonnet, Llama-3.1 405B, GPT-4o, o1-preview, and Gemini-1.5 Pro.
arXiv Detail & Related papers (2025-02-20T12:28:23Z)
LiveIdeaBench: Evaluating LLMs' Divergent Thinking for Scientific Idea Generation with Minimal Context [13.967898012303325]
We introduce LiveIdeaBench, a benchmark evaluating Large Language Models' scientific idea generation. Our benchmark employs a dynamic panel of state-of-the-art LLMs to assess generated ideas across five key dimensions: originality, feasibility, fluency, flexibility, and clarity. Our results demonstrate that models like QwQ-32B-preview achieve creative performance comparable to top-tier models such as claude-3.7-sonnet:thinking, despite significant gaps in their general intelligence scores.
arXiv Detail & Related papers (2024-12-23T14:13:44Z)
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs [97.94579295913606]
Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia. In the development process, evaluation is critical since it provides intuitive feedback and guidance on improving models. This work aims to offer researchers an easy grasp of how to effectively evaluate MLLMs according to different needs and to inspire better evaluation methods.
arXiv Detail & Related papers (2024-11-22T18:59:54Z)
Large Language Model for Qualitative Research -- A Systematic Mapping Study [3.302912592091359]
Large Language Models (LLMs), powered by advanced generative AI, have emerged as transformative tools. This study systematically maps the literature on the use of LLMs for qualitative research. Findings reveal that LLMs are utilized across diverse fields, demonstrating the potential to automate processes.
arXiv Detail & Related papers (2024-11-18T21:28:00Z)
Chain of Ideas: Revolutionizing Research Via Novel Idea Development with LLM Agents [64.64280477958283]
An exponential increase in scientific literature makes it challenging for researchers to stay current with recent advances and identify meaningful research directions. Recent developments in large language models(LLMs) suggest a promising avenue for automating the generation of novel research ideas. We propose a Chain-of-Ideas(CoI) agent, an LLM-based agent that organizes relevant literature in a chain structure to effectively mirror the progressive development in a research domain.
arXiv Detail & Related papers (2024-10-17T03:26:37Z)
Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored. We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches. We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z)
Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers [90.26363107905344]
Large language models (LLMs) have sparked optimism about their potential to accelerate scientific discovery. No evaluations have shown that LLM systems can take the very first step of producing novel, expert-level ideas.
arXiv Detail & Related papers (2024-09-06T08:25:03Z)
Systematic Task Exploration with LLMs: A Study in Citation Text Generation [63.50597360948099]
Large language models (LLMs) bring unprecedented flexibility in defining and executing complex, creative natural language generation (NLG) tasks. We propose a three-component research framework that consists of systematic input manipulation, reference data, and output measurement. We use this framework to explore citation text generation -- a popular scholarly NLP task that lacks consensus on the task definition and evaluation metric.
arXiv Detail & Related papers (2024-07-04T16:41:08Z)
ResearchAgent: Iterative Research Idea Generation over Scientific Literature with Large Language Models [56.08917291606421]
ResearchAgent is an AI-based system for ideation and operationalization of novel work. ResearchAgent automatically defines novel problems, proposes methods and designs experiments, while iteratively refining them. We experimentally validate our ResearchAgent on scientific publications across multiple disciplines.
arXiv Detail & Related papers (2024-04-11T13:36:29Z)
Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence [5.147767778946168]
We critically assess 23 state-of-the-art Large Language Models (LLMs) benchmarks. Our research uncovered significant limitations, including biases, difficulties in measuring genuine reasoning, adaptability, implementation inconsistencies, prompt engineering complexity, diversity, and the overlooking of cultural and ideological norms.
arXiv Detail & Related papers (2024-02-15T11:08:10Z)
A Survey on Large Language Models for Recommendation [77.91673633328148]
Large Language Models (LLMs) have emerged as powerful tools in the field of Natural Language Processing (NLP) This survey presents a taxonomy that categorizes these models into two major paradigms, respectively Discriminative LLM for Recommendation (DLLM4Rec) and Generative LLM for Recommendation (GLLM4Rec)
arXiv Detail & Related papers (2023-05-31T13:51:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.