Related papers: Integrating Counterfactual Simulations with Language Models for Explaining Multi-Agent Behaviour

Related papers

Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents [8.760287445955045]
Large language model (LLM) agents operating via multi-turn dialogue and multi-step tool calling are increasingly used in production.<n>Prior agentic benchmarks rely on fully deterministic backends, which are costly to build and iterate.<n>We propose Proxy State-Based Evaluation, an LLM-driven simulation framework that preserves final state-based evaluation without a deterministic database.
arXiv Detail & Related papers (2026-02-18T07:49:47Z)
AgentDrive: An Open Benchmark Dataset for Agentic AI Reasoning with LLM-Generated Scenarios in Autonomous Systems [3.099103925863002]
This paper introduces AgentDrive, an open benchmark dataset containing 300,000 driving scenarios.<n>AgentDrive formalizes a factorized scenario space across seven axes: scenario type, driver behavior, environment, road layout, objective, difficulty, and traffic density.<n>To complement simulation-based evaluation, we introduce AgentDrive-MCQ, a 100,000-question multiple-choice benchmark spanning five reasoning dimensions.
arXiv Detail & Related papers (2026-01-23T18:33:41Z)
Sample-Efficient Online Learning in LM Agents via Hindsight Trajectory Rewriting [92.57796055887995]
We introduce ECHO, a prompting framework that adapts hindsight experience replay from reinforcement learning for language model agents.<n> ECHO generates optimized trajectories for alternative goals that could have been achieved during failed attempts.<n>We evaluate ECHO on stateful versions of XMiniGrid, a text-based navigation and planning benchmark, and PeopleJoinQA, a collaborative information-gathering enterprise simulation.
arXiv Detail & Related papers (2025-10-11T18:11:09Z)
How can we assess human-agent interactions? Case studies in software agent design [52.953425368394306]
We make two major steps towards the rigorous assessment of human-agent interactions.<n>We propose PULSE, a framework for more efficient human-centric evaluation of agent designs.<n>We deploy the framework on a large-scale web platform built around the open-source software agent OpenHands.
arXiv Detail & Related papers (2025-10-10T19:04:28Z)
AutoSCORE: Enhancing Automated Scoring with Multi-Agent Large Language Models via Structured Component Recognition [27.312190686305588]
Large language models (LLMs) have shown strong potential in automated scoring.<n>Their use as end-to-end raters faces challenges such as low accuracy, prompt sensitivity, limited interpretability, and rubric misalignment.<n>We propose AutoSCORE, a multi-agent LLM framework enhancing automated scoring via rubric-aligned Structured COmponent REcognition.
arXiv Detail & Related papers (2025-09-26T05:45:14Z)
CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z)
AutoJudger: An Agent-Driven Framework for Efficient Benchmarking of MLLMs [24.403284945948272]
AutoJudger is an agent-driven framework for efficient and adaptive benchmarking of multimodal large language models.<n>AutoJudger employs the Item Response Theory (IRT) to estimate the question difficulty and an autonomous evaluation agent to dynamically select the most informative test questions.
arXiv Detail & Related papers (2025-05-27T16:17:15Z)
IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis [60.32962597618861]
IDA-Bench is a novel benchmark evaluating large language models in multi-round interactive scenarios.<n>Agent performance is judged by comparing its final numerical output to the human-derived baseline.<n>Even state-of-the-art coding agents (like Claude-3.7-thinking) succeed on 50% of the tasks, highlighting limitations not evident in single-turn tests.
arXiv Detail & Related papers (2025-05-23T09:37:52Z)
Reinforcing Question Answering Agents with Minimalist Policy Gradient Optimization [80.09112808413133]
Mujica is a planner that decomposes questions into acyclic graph of subquestions and a worker that resolves questions via retrieval and reasoning.<n>MyGO is a novel reinforcement learning method that replaces traditional policy updates with gradient Likelihood Maximum Estimation.<n> Empirical results across multiple datasets demonstrate the effectiveness of MujicaMyGO in enhancing multi-hop QA performance.
arXiv Detail & Related papers (2025-05-20T18:33:03Z)
Meta-Evaluating Local LLMs: Rethinking Performance Metrics for Serious Games [3.725822359130832]
Large Language Models (LLMs) are increasingly being explored as evaluators in serious games.<n>This study investigates the reliability of five small-scale LLMs when assessing player responses in textitEn-join, a game that simulates decision-making within energy communities.<n>Our results highlight the strengths and limitations of each model, revealing trade-offs between sensitivity, specificity, and overall performance.
arXiv Detail & Related papers (2025-04-13T10:46:13Z)
DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding [76.3876070043663]
We propose DriveLMM-o1, a dataset and benchmark designed to advance step-wise visual reasoning for autonomous driving.<n>Our benchmark features over 18k VQA examples in the training set and more than 4k in the test set, covering diverse questions on perception, prediction, and planning.<n>Our model achieves a +7.49% gain in final answer accuracy, along with a 3.62% improvement in reasoning score over the previous best open-source model.
arXiv Detail & Related papers (2025-03-13T17:59:01Z)
SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Models [74.40683913645731]
Zero-shot multi-label recognition (MLR) with Vision-Language Models (VLMs) faces significant challenges without training data, model tuning, or architectural modifications.<n>Our work proposes a novel solution treating VLMs as black boxes, leveraging scores without training data or ground truth.<n>Analysis of these prompt scores reveals VLM biases and AND''/OR' signal ambiguities, notably that maximum scores are surprisingly suboptimal compared to second-highest scores.
arXiv Detail & Related papers (2025-02-24T07:15:05Z)
Scaling Autonomous Agents via Automatic Reward Modeling And Planning [52.39395405893965]
Large language models (LLMs) have demonstrated remarkable capabilities across a range of tasks.<n>However, they still struggle with problems requiring multi-step decision-making and environmental feedback.<n>We propose a framework that can automatically learn a reward model from the environment without human annotations.
arXiv Detail & Related papers (2025-02-17T18:49:25Z)
AIvaluateXR: An Evaluation Framework for on-Device AI in XR with Benchmarking Results [55.33807002543901]
We present AIvaluateXR, a comprehensive evaluation framework for benchmarking large language models (LLMs) running on XR devices.<n>We deploy 17 selected LLMs across four XR platforms: Magic Leap 2, Meta Quest 3, Vivo X100s Pro, and Apple Vision Pro, and conduct an extensive evaluation.<n>We propose a unified evaluation method based on the 3D Optimality theory to select the optimal device-model pairs from quality and speed objectives.
arXiv Detail & Related papers (2025-02-13T20:55:48Z)
MetaOpenFOAM: an LLM-based multi-agent framework for CFD [11.508919041921942]
MetaOpenFOAM is a novel multi-agent collaborations framework. It aims to complete CFD simulation tasks with only natural language as input. It harnesses the power of MetaGPT's assembly line paradigm.
arXiv Detail & Related papers (2024-07-31T04:01:08Z)
MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains [54.117238759317004]
Massive Multitask Agent Understanding (MMAU) benchmark features comprehensive offline tasks that eliminate the need for complex environment setups. It evaluates models across five domains, including Tool-use, Directed Acyclic Graph (DAG) QA, Data Science and Machine Learning coding, Contest-level programming and Mathematics. With a total of 20 meticulously designed tasks encompassing over 3K distinct prompts, MMAU provides a comprehensive framework for evaluating the strengths and limitations of LLM agents.
arXiv Detail & Related papers (2024-07-18T00:58:41Z)
Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode. We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z)
Applied metamodelling for ATM performance simulations [0.0]
XALM (eXplainable Active Learning Metamodel) is a framework integrating active learning and SHAP values into simulation metamodels. XALM efficiently uncovers hidden relationships among input and output variables in ATM simulators. Case study shows XALM's effectiveness in enhancing simulation interpretability and understanding variable interactions.
arXiv Detail & Related papers (2023-08-07T08:46:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.