Related papers: Prism: Dynamic and Flexible Benchmarking of LLMs Code Generation with Monte Carlo Tree Search

Prism: Dynamic and Flexible Benchmarking of LLMs Code Generation with Monte Carlo Tree Search

URL: http://arxiv.org/abs/2504.05500v2
Date: Thu, 10 Apr 2025 01:06:05 GMT
Title: Prism: Dynamic and Flexible Benchmarking of LLMs Code Generation with Monte Carlo Tree Search
Authors: Vahid Majdinasab, Amin Nikanjam, Foutse Khomh,
Abstract summary: Static benchmarks fail to capture the depth and breadth of Large Language Models (LLMs) capabilities.<n>We introduce Prism, a flexible, dynamic benchmarking framework designed for comprehensive LLM assessment.<n> Prism builds on three key components: (1) a tree-based state representation that models evaluation as a Markov Decision Process, (2) a Monte Carlo Tree Search algorithm adapted to uncover challenging evaluation scenarios, and (3) a multi-agent evaluation pipeline that enables simultaneous assessment of diverse capabilities.
Score: 13.135962181354465
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid advancement of Large Language Models (LLMs) has outpaced traditional evaluation methods. Static benchmarks fail to capture the depth and breadth of LLM capabilities and eventually become obsolete, while most dynamic approaches either rely too heavily on LLM-based evaluation or remain constrained by predefined test sets. We introduce Prism, a flexible, dynamic benchmarking framework designed for comprehensive LLM assessment. Prism builds on three key components: (1) a tree-based state representation that models evaluation as a Markov Decision Process, (2) a Monte Carlo Tree Search algorithm adapted to uncover challenging evaluation scenarios, and (3) a multi-agent evaluation pipeline that enables simultaneous assessment of diverse capabilities. To ensure robust evaluation, Prism integrates structural measurements of tree exploration patterns with performance metrics across difficulty levels, providing detailed diagnostics of error patterns, test coverage, and solution approaches. Through extensive experiments on five state-of-the-art LLMs, we analyze how model architecture and scale influence code generation performance across varying task difficulties. Our results demonstrate Prism's effectiveness as a dynamic benchmark that evolves with model advancements while offering deeper insights into their limitations.

Related papers

Discrete Tokenization for Multimodal LLMs: A Comprehensive Survey [69.45421620616486]
This work presents the first structured taxonomy and analysis of discrete tokenization methods designed for large language models (LLMs)<n>We categorize 8 representative VQ variants that span classical and modern paradigms and analyze their algorithmic principles, training dynamics, and integration challenges with LLM pipelines.<n>We identify key challenges including codebook collapse, unstable gradient estimation, and modality-specific encoding constraints.
arXiv Detail & Related papers (2025-07-21T10:52:14Z)
StoryBench: A Dynamic Benchmark for Evaluating Long-Term Memory with Multi Turns [7.60350050736492]
Long-term memory is essential for large language models to achieve autonomous intelligence.<n>Existing benchmarks face challenges in evaluating knowledge retention and dynamic sequential reasoning.<n>We propose a novel benchmark framework based on interactive fiction games.
arXiv Detail & Related papers (2025-06-16T10:54:31Z)
KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation [78.96590724864606]
We introduce the Knowledge Orthogonal Reasoning Gymnasium (KORGym), a dynamic evaluation platform inspired by KOR-Bench and Gymnasium.<n>KORGym offers over fifty games in either textual or visual formats and supports interactive, multi-turn assessments with reinforcement learning scenarios.
arXiv Detail & Related papers (2025-05-20T16:06:32Z)
MCTS-Judge: Test-Time Scaling in LLM-as-a-Judge for Code Correctness Evaluation [17.432401371613903]
We propose a resource-efficient, System-2 thinking framework for code correctness evaluation. MCTS-Judge uses Monte Carlo Tree Search to decompose problems into simpler, multi-perspective evaluations. High-precision, unit-test-level reward mechanism encourages the Large Language Model to perform line-by-line analysis.
arXiv Detail & Related papers (2025-02-18T02:55:48Z)
StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs [78.84060166851805]
StructTest is a novel benchmark that evaluates large language models (LLMs) on their ability to follow compositional instructions and generate structured outputs.<n> Assessments are conducted deterministically using a rule-based evaluator, which can be easily extended to new tasks and datasets.<n>We demonstrate that StructTest remains challenging even for top-performing models like Deepseek-V3/R1 and GPT-4o.
arXiv Detail & Related papers (2024-12-23T22:08:40Z)
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs [97.94579295913606]
Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia.<n>In the development process, evaluation is critical since it provides intuitive feedback and guidance on improving models.<n>This work aims to offer researchers an easy grasp of how to effectively evaluate MLLMs according to different needs and to inspire better evaluation methods.
arXiv Detail & Related papers (2024-11-22T18:59:54Z)
Revisiting Benchmark and Assessment: An Agent-based Exploratory Dynamic Evaluation Framework for LLMs [29.72874725703848]
We introduce two key concepts: Benchmark+, which extends the traditional question-answer benchmark into a more flexible strategy-criterion'' format; and Assessment+, which enhances the interaction process.<n>We propose TestAgent, an agent-based evaluation framework that implements these concepts using retrieval-augmented generation and reinforcement learning.<n>TestAgent enables automatic dynamic benchmark generation and in-depth assessment across diverse vertical domain scenarios.
arXiv Detail & Related papers (2024-10-15T11:20:42Z)
X2-DFD: A framework for eXplainable and eXtendable Deepfake Detection [55.77552681618732]
X2-DFD is an eXplainable and eXtendable framework based on multimodal large-language models (MLLMs) for deepfake detection.<n>The first stage, Model Feature Assessment, systematically evaluates the detectability of forgery-related features for the MLLM.<n>The second stage, Explainable dataset Construction, consists of two key modules: Strong Feature Strengthening and Weak Feature Supplementing.<n>The third stage, Fine-tuning and Inference, involves fine-tuning the MLLM on the constructed dataset and deploying it for final detection and explanation.
arXiv Detail & Related papers (2024-10-08T15:28:33Z)
LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning [56.273799410256075]
The framework combines Monte Carlo Tree Search (MCTS) with iterative Self-Refine to optimize the reasoning path. The framework has been tested on general and advanced benchmarks, showing superior performance in terms of search efficiency and problem-solving capability.
arXiv Detail & Related papers (2024-10-03T18:12:29Z)
UltraEval: A Lightweight Platform for Flexible and Comprehensive Evaluation for LLMs [74.1976921342982]
This paper introduces UltraEval, a user-friendly evaluation framework characterized by its lightweight nature, comprehensiveness, modularity, and efficiency. The resulting composability allows for the free combination of different models, tasks, prompts, benchmarks, and metrics within a unified evaluation workflow.
arXiv Detail & Related papers (2024-04-11T09:17:12Z)
Towards a Robust Retrieval-Based Summarization System [11.747998334533776]
This paper describes an investigation of the robustness of large language models (LLMs) for retrieval augmented generation (RAG)-based summarization tasks. Our first contribution is LogicSumm, an innovative evaluation framework incorporating realistic scenarios. Based on limitations identified by LogiSumm, we then developed SummRAG, a comprehensive system to create training dialogues and fine-tune a model to enhance robustness.
arXiv Detail & Related papers (2024-03-29T00:14:46Z)
Dynamic Evaluation of Large Language Models by Meta Probing Agents [44.20074234421295]
We propose meta probing agents (MPA) to evaluate large language models (LLMs) MPA is the key component of DyVal 2, which naturally extends the previous DyValcitepzhu2023dyval. MPA designs the probing and judging agents to automatically transform an original evaluation problem into a new one following psychometric theory.
arXiv Detail & Related papers (2024-02-21T06:46:34Z)
Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation [51.99752147380505]
This paper presents a benchmark self-evolving framework to dynamically evaluate Large Language Models (LLMs) We utilize a multi-agent system to manipulate the context or question of original instances, reframing new evolving instances with high confidence. Our framework widens performance discrepancies both between different models and within the same model across various tasks.
arXiv Detail & Related papers (2024-02-18T03:40:06Z)
AQA-Bench: An Interactive Benchmark for Evaluating LLMs' Sequential Reasoning Ability [29.1826948551409]
AQA-Bench is a novel benchmark to assess the sequential reasoning capabilities of large language models. We build AQA-Bench with three different algorithms, namely binary search, depth-first search, and breadth-first search. Our investigations reveal several interesting findings.
arXiv Detail & Related papers (2024-02-14T18:59:33Z)
DRDT: Dynamic Reflection with Divergent Thinking for LLM-based Sequential Recommendation [53.62727171363384]
We introduce a novel reasoning principle: Dynamic Reflection with Divergent Thinking. Our methodology is dynamic reflection, a process that emulates human learning through probing, critiquing, and reflecting. We evaluate our approach on three datasets using six pre-trained LLMs.
arXiv Detail & Related papers (2023-12-18T16:41:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.