Shoot First, Ask Questions Later? Building Rational Agents that Explore and Act Like People
- URL: http://arxiv.org/abs/2510.20886v1
- Date: Thu, 23 Oct 2025 17:57:28 GMT
- Title: Shoot First, Ask Questions Later? Building Rational Agents that Explore and Act Like People
- Authors: Gabriel Grand, Valerio Pepe, Jacob Andreas, Joshua B. Tenenbaum,
- Abstract summary: Given limited resources, to what extent do agents based on language models (LMs) act rationally?<n>We develop methods to benchmark and enhance agentic information-seeking, drawing on insights from human behavior.<n>For Spotter agents, our approach boosts accuracy by up to 14.7% absolute over LM-only baselines; for Captain agents, it raises expected information gain (EIG) by up to 0.227 bits (94.2% of the achievable noise ceiling)
- Score: 81.63702981397408
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many high-stakes applications of AI require forming data-driven hypotheses and making targeted guesses; e.g., in scientific and diagnostic settings. Given limited resources, to what extent do agents based on language models (LMs) act rationally? We develop methods to benchmark and enhance agentic information-seeking, drawing on insights from human behavior. First, we introduce a strategic decision-oriented dialogue task called Collaborative Battleship, in which a partially-informed Captain must balance exploration (asking questions) and action (taking shots), while a fully-informed Spotter must provide accurate answers under an information bottleneck. Compared to human players (N=42), we find that LM agents struggle to ground answers in context, generate informative questions, and select high-value actions. Next, to address these gaps, we develop novel Monte Carlo inference strategies for LMs based on principles from Bayesian Experimental Design (BED). For Spotter agents, our approach boosts accuracy by up to 14.7% absolute over LM-only baselines; for Captain agents, it raises expected information gain (EIG) by up to 0.227 bits (94.2% of the achievable noise ceiling). Combined, these components yield sharper targeting (+0.303-0.374 F1), and enable weaker LMs, such as Llama-4-Scout, to outperform both humans (8% -> 82% win rate) and frontier models (0% -> 67% win rate vs. GPT-5) at ~1% of GPT-5's cost. We replicate these findings on Guess Who? where our methods significantly boost accuracy (+28.3-42.4 p.p.), demonstrating their general applicability for building rational information-seeking agents.
Related papers
- AgentIR: Reasoning-Aware Retrieval for Deep Research Agents [76.29382561831105]
Deep Research agents generate explicit natural language reasoning before each search call.<n> Reasoning-Aware Retrieval embeds the agent's reasoning trace alongside its query.<n>DR- Synth generates Deep Research retriever training data from standard QA datasets.<n>AgentIR-4B achieves 68% accuracy with the open-weight agent Tongyi-DeepResearch.
arXiv Detail & Related papers (2026-03-04T18:47:26Z) - ResearchGym: Evaluating Language Model Agents on Real-World AI Research [48.46915933681714]
We introduce ResearchGym, a benchmark and execution environment for evaluating AI agents on end-to-end research.<n>To instantiate this, we repurpose five oral and spotlight papers from ICML, ICLR, and ACL.<n>In a controlled evaluation of an agent powered by GPT-5, we observe a sharp capability--reliability gap.
arXiv Detail & Related papers (2026-02-16T19:00:03Z) - Cybersecurity AI: A Game-Theoretic AI for Guiding Attack and Defense [1.0933254855925085]
Generative Cut-the-Rope (G-CTR) is a game-theoretic guidance layer that extracts attack graphs from agent's context.<n>In five real-world exercises, G-CTR matches 70--90% of expert graph structure while running 60--245x faster and over 140x cheaper than manual analysis.
arXiv Detail & Related papers (2026-01-09T16:06:10Z) - Emergence: Overcoming Privileged Information Bias in Asymmetric Embodied Agents via Active Querying [0.0]
Large Language Models (LLMs) act as powerful reasoning engines but struggle with "symbol grounding" in embodied environments.<n>We investigate the Privileged Information Bias (or "Curse of Knowledge"), where a knowledgeable "Leader" agent fails to guide a sensor-limited "Follower" due to a lack of Theory of Mind.<n>Our experiments reveal a significant "Success Gap": while the Leader successfully perceives the target in 35.0% of episodes, the collaborative team succeeds only 17.0% of the time, implying that nearly 50% of feasible plans fail solely due to communicative grounding errors.
arXiv Detail & Related papers (2025-12-13T17:17:51Z) - RefineBench: Evaluating Refinement Capability of Language Models via Checklists [71.02281792867531]
We evaluate two refinement modes: guided refinement and self-refinement.<n>In guided refinement, both proprietary LMs and large open-weight LMs can leverage targeted feedback to refine responses to near-perfect levels within five turns.<n>These findings suggest that frontier LMs require breakthroughs to self-refine their incorrect responses.
arXiv Detail & Related papers (2025-11-27T07:20:52Z) - AgentPRM: Process Reward Models for LLM Agents via Step-Wise Promise and Progress [71.02263260394261]
Large language models (LLMs) still encounter challenges in multi-turn decision-making tasks.<n>We build process reward models (PRMs) to evaluate each decision and guide the agent's decision-making process.<n>AgentPRM captures both the interdependence between sequential decisions and their contribution to the final goal.
arXiv Detail & Related papers (2025-11-11T14:57:54Z) - Scheming Ability in LLM-to-LLM Strategic Interactions [4.873362301533824]
Large language model (LLM) agents are deployed autonomously in diverse contexts.<n>We investigate the ability and propensity of frontier LLM agents through two game-theoretic frameworks.<n>Tests four models (GPT-4o, Gemini-2.5-pro, Claude-3.7-Sonnet, and Llama-3.3-70b)
arXiv Detail & Related papers (2025-10-11T04:42:29Z) - From Harm to Help: Turning Reasoning In-Context Demos into Assets for Reasoning LMs [58.02809208460186]
We revisit this paradox using high-quality traces from DeepSeek-R1 as demonstrations.<n>We find that adding more exemplars consistently degrades accuracy, even when demonstrations are optimal.<n>We introduce Insight-to-solve (I2S), a sequential test-time procedure that turns demonstrations into explicit, reusable insights.
arXiv Detail & Related papers (2025-09-27T08:59:31Z) - Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning [53.45095336430027]
We develop a unified framework that combines implicit retrieval and structured collaboration.<n>On Humanity's Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3% accuracy.<n>Results on SuperGPQA and TRQA confirm robustness across domains.
arXiv Detail & Related papers (2025-09-25T14:05:55Z) - Agentic UAVs: LLM-Driven Autonomy with Integrated Tool-Calling and Cognitive Reasoning [3.4643961367503575]
Existing UAV frameworks lack context-aware reasoning, autonomous decision-making, and ecosystem-level integration.<n>This paper introduces the Agentic UAVs framework, a five-layer architecture (Perception, Reasoning, Action, Integration, Learning)<n>A ROS2 and Gazebo-based prototype integrates YOLOv11 object detection with GPT-4 reasoning and local Gemma-3 deployment.
arXiv Detail & Related papers (2025-09-14T08:46:40Z) - SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents [93.26456498576181]
This paper focuses on the development of native Autonomous Single-Agent models for Deep Research.<n>Our best variant SFR-DR-20B achieves up to 28.7% on Humanity's Last Exam benchmark.
arXiv Detail & Related papers (2025-09-08T02:07:09Z) - Fact or Facsimile? Evaluating the Factual Robustness of Modern Retrievers [34.31192184496381]
Dense retrievers and rerankers are central to retrieval-augmented generation (RAG) pipelines.<n>We evaluate how much factual competence these components inherit or lose from large language models (LLMs) they are based on.<n>For every embedding model, cosine-similarity scores between queries and correct completions are significantly higher than those for incorrect ones.
arXiv Detail & Related papers (2025-08-28T04:13:51Z) - Can LLM Agents Simulate Multi-Turn Human Behavior? Evidence from Real Online Customer Behavior Data [46.65903742010956]
We present the first large-scale quantitative evaluation of state-of-the-art LLMs' ability to accurately simulate human behavior.<n>Using real-world data from 31,865 online shopping sessions containing 230,965 user actions, our evaluation reveals that prompt-based LLMs achieve only 11.86% accuracy in generating human actions.<n>We also showcase that strategies as simple as fine-tuning LLMs on real human click-through data augmented with synthesized reasoning traces can greatly enhance models' performance.
arXiv Detail & Related papers (2025-03-26T17:33:27Z) - Better Zero-Shot Reasoning with Role-Play Prompting [10.90357246745529]
Role-play prompting consistently surpasses the standard zero-shot approach across most datasets.
This highlights its potential to augment the reasoning capabilities of large language models.
arXiv Detail & Related papers (2023-08-15T11:08:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.