GUIDE: Towards Scalable Advising for Research Ideas
- URL: http://arxiv.org/abs/2507.08870v1
- Date: Wed, 09 Jul 2025 17:59:21 GMT
- Title: GUIDE: Towards Scalable Advising for Research Ideas
- Authors: Yaowenqi Liu, BingXu Meng, Rui Pan, Jerry Huang, Tong Zhang,
- Abstract summary: We develop a system to provide high-quality, well-reasoned feedback to refine proposed hypotheses and experimental designs.<n>Our system achieves an acceptance rate exceeding 90% on the ICLR 2025 test set.
- Score: 9.819083407389524
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The field of AI research is advancing at an unprecedented pace, enabling automated hypothesis generation and experimental design across diverse domains such as biology, mathematics, and artificial intelligence. Despite these advancements, there remains a significant gap in the availability of scalable advising systems capable of providing high-quality, well-reasoned feedback to refine proposed hypotheses and experimental designs. To address this challenge, we explore key factors that underlie the development of robust advising systems, including model size, context length, confidence estimation, and structured reasoning processes. Our findings reveal that a relatively small model, when equipped with a well-compressed literature database and a structured reasoning framework, can outperform powerful general-purpose language models such as Deepseek-R1 in terms of acceptance rates for self-ranked top-30% submissions to ICLR 2025. Moreover, when limited to high-confidence predictions, our system achieves an acceptance rate exceeding 90% on the ICLR 2025 test set, underscoring its potential to significantly enhance the quality and efficiency of hypothesis generation and experimental design. The code is released at https://github.com/HowardLiu0830/GUIDE-Research-Idea-Evaluation.
Related papers
- Bayes-Entropy Collaborative Driven Agents for Research Hypotheses Generation and Optimization [4.469102316542763]
This paper proposes a multi-agent collaborative framework called HypoAgents.<n>It generates hypotheses through diversity sampling and establishes prior beliefs.<n>It then employs etrieval-augmented generation (RAG) to gather external literature evidence.<n>It identifies high-uncertainty hypotheses using information entropy $H = - sum p_ilog p_i$ and actively refines them.
arXiv Detail & Related papers (2025-08-03T13:05:32Z) - Lost at the Beginning of Reasoning [82.18834329384514]
We show that the first reasoning step exerts a disproportionately large influence on the final prediction.<n>We propose an efficient sampling strategy that leverages a reward model to identify and retain high-quality first reasoning steps.<n>We introduce a new benchmark specifically constructed with deliberately flawed first reasoning steps to systematically evaluate model self-correction capabilities.
arXiv Detail & Related papers (2025-06-27T09:53:57Z) - The AI Imperative: Scaling High-Quality Peer Review in Machine Learning [49.87236114682497]
We argue that AI-assisted peer review must become an urgent research and infrastructure priority.<n>We propose specific roles for AI in enhancing factual verification, guiding reviewer performance, assisting authors in quality improvement, and supporting ACs in decision-making.
arXiv Detail & Related papers (2025-06-09T18:37:14Z) - AI Idea Bench 2025: AI Research Idea Generation Benchmark [10.983418515389667]
We present AI Idea Bench 2025, a framework designed to quantitatively evaluate and compare the ideas generated by Language Models (LLMs)<n>The framework comprises a comprehensive dataset of 3,495 AI papers and their associated inspired works, along with a robust evaluation methodology.<n>The evaluation system gauges idea quality in two dimensions: alignment with the ground-truth content of the original papers and judgment based on general reference material.
arXiv Detail & Related papers (2025-04-19T05:35:45Z) - LiveIdeaBench: Evaluating LLMs' Divergent Thinking for Scientific Idea Generation with Minimal Context [13.967898012303325]
We introduce LiveIdeaBench, a benchmark evaluating Large Language Models' scientific idea generation.<n>Our benchmark employs a dynamic panel of state-of-the-art LLMs to assess generated ideas across five key dimensions: originality, feasibility, fluency, flexibility, and clarity.<n>Our results demonstrate that models like QwQ-32B-preview achieve creative performance comparable to top-tier models such as claude-3.7-sonnet:thinking, despite significant gaps in their general intelligence scores.
arXiv Detail & Related papers (2024-12-23T14:13:44Z) - Learning to Generate Research Idea with Dynamic Control [21.30777644522451]
Large language models (LLMs) have shown promise in generating hypotheses and research ideas.<n>We introduce a novel framework that employs a two-stage approach combiningSupervised Fine-Tuning (SFT) and controllable Reinforcement Learning (RL)<n>Our framework provides a balanced approach to research ideation, achieving high-quality outcomes by dynamically navigating the trade-offs among novelty, feasibility, and effectiveness.
arXiv Detail & Related papers (2024-12-19T08:28:18Z) - Explore Theory of Mind: Program-guided adversarial data generation for theory of mind reasoning [88.68573198200698]
We introduce ExploreToM, the first framework to allow large-scale generation of diverse and challenging theory of mind data.<n>Our approach leverages an A* search over a custom domain-specific language to produce complex story structures and novel, diverse, yet plausible scenarios.<n>Our evaluation reveals that state-of-the-art LLMs, such as Llama-3.1-70B and GPT-4o, show accuracies as low as 0% and 9% on ExploreToM-generated data.
arXiv Detail & Related papers (2024-12-12T21:29:00Z) - Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems [92.89673285398521]
o1-like reasoning systems have demonstrated remarkable capabilities in solving complex reasoning tasks.<n>We introduce an imitate, explore, and self-improve'' framework to train the reasoning model.<n>Our approach achieves competitive performance compared to industry-level reasoning systems.
arXiv Detail & Related papers (2024-12-12T16:20:36Z) - MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z) - Understanding, Predicting and Better Resolving Q-Value Divergence in
Offline-RL [86.0987896274354]
We first identify a fundamental pattern, self-excitation, as the primary cause of Q-value estimation divergence in offline RL.
We then propose a novel Self-Excite Eigenvalue Measure (SEEM) metric to measure the evolving property of Q-network at training.
For the first time, our theory can reliably decide whether the training will diverge at an early stage.
arXiv Detail & Related papers (2023-10-06T17:57:44Z) - Neural Message Passing for Objective-Based Uncertainty Quantification
and Optimal Experimental Design [15.692012868181635]
We propose a novel scheme to reduce the computational cost for objective-UQ via MOCU based on a data-driven approach.
We show that our proposed approach can accelerate MOCU-based OED by four to five orders of magnitude, without any visible performance loss.
arXiv Detail & Related papers (2022-03-14T14:08:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.