AstroReason-Bench: Evaluating Unified Agentic Planning across Heterogeneous Space Planning Problems
- URL: http://arxiv.org/abs/2601.11354v1
- Date: Fri, 16 Jan 2026 15:02:41 GMT
- Title: AstroReason-Bench: Evaluating Unified Agentic Planning across Heterogeneous Space Planning Problems
- Authors: Weiyi Wang, Xinchi Chen, Jingjing Gong, Xuanjing Huang, Xipeng Qiu,
- Abstract summary: We introduce AstroReason-Bench, a benchmark for evaluating agentic planning in Space Planning Problems (SPP)<n>AstroReason-Bench integrates multiple scheduling regimes, including ground station communication and agile Earth observation, and provides a unified agent-oriented interaction protocol.<n>We find that current agents substantially underperform specialized solvers, highlighting key limitations of generalist planning under realistic constraints.
- Score: 71.89040853616602
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in agentic Large Language Models (LLMs) have positioned them as generalist planners capable of reasoning and acting across diverse tasks. However, existing agent benchmarks largely focus on symbolic or weakly grounded environments, leaving their performance in physics-constrained real-world domains underexplored. We introduce AstroReason-Bench, a comprehensive benchmark for evaluating agentic planning in Space Planning Problems (SPP), a family of high-stakes problems with heterogeneous objectives, strict physical constraints, and long-horizon decision-making. AstroReason-Bench integrates multiple scheduling regimes, including ground station communication and agile Earth observation, and provides a unified agent-oriented interaction protocol. Evaluating on a range of state-of-the-art open- and closed-source agentic LLM systems, we find that current agents substantially underperform specialized solvers, highlighting key limitations of generalist planning under realistic constraints. AstroReason-Bench offers a challenging and diagnostic testbed for future agentic research.
Related papers
- Agentic Spatio-Temporal Grounding via Collaborative Reasoning [80.83158605034465]
Temporal Video Grounding aims to retrieve thetemporal tube of a target object or person in a video given a text query.<n>We propose the Agentic Spatio-Temporal Grounder (ASTG) framework for the task of STVG towards an open-world and training-free scenario.<n>Specifically, two specialized agents SRA (Spatial Reasoning Agent) and TRA (Temporal Reasoning Agent) constructed leveraging on modern Multimoal Large Language Models (MLLMs)<n>Experiments on popular benchmarks demonstrate the superiority of the proposed approach where it outperforms existing weakly-supervised and zero-shot approaches by a margin
arXiv Detail & Related papers (2026-02-10T10:16:27Z) - TodoEvolve: Learning to Architect Agent Planning Systems [68.48983335970901]
TodoEvolve is a meta-planning paradigm that autonomously synthesizes and dynamically revises task-specific planning.<n>PlanFactory provides a common interface for heterogeneous planning patterns.<n>TodoEvolve consistently surpasses carefully engineered planning modules while maintaining economical API costs and runtime overhead.
arXiv Detail & Related papers (2026-02-08T06:37:01Z) - Mil-SCORE: Benchmarking Long-Context Geospatial Reasoning and Planning in Large Language Models [1.8690248464957548]
MilSCORE is a scenario-level dataset of expert-authored, multi-hop questions grounded in a simulated military planning scenario.<n>The benchmark includes a diverse set of question types across seven categories targeting both factual recall and multi-step reasoning.<n>Our findings highlight substantial headroom on MilSCORE, indicating that current systems struggle with realistic, scenario-level long-context planning.
arXiv Detail & Related papers (2026-01-29T15:11:31Z) - Designing Domain-Specific Agents via Hierarchical Task Abstraction Mechanism [61.01709143437043]
We introduce a novel agent design framework centered on a Hierarchical Task Abstraction Mechanism (HTAM)<n>Specifically, HTAM moves beyond emulating social roles, instead structuring multi-agent systems into a logical hierarchy that mirrors the intrinsic task-dependency graph of a given domain.<n>We instantiate this framework as EarthAgent, a multi-agent system tailored for complex geospatial analysis.
arXiv Detail & Related papers (2025-11-21T12:25:47Z) - Earth-Agent: Unlocking the Full Landscape of Earth Observation with Agents [49.3216026940601]
Earth observation is essential for understanding the states of the Earth system.<n>Recent MLLMs have advanced EO research, but they still lack the capability to tackle complex tasks that require multi-step reasoning.<n>We introduce Earth-Agent, the first agentic framework that unifies RGB and spectral EO data within an MCP-based tool ecosystem.
arXiv Detail & Related papers (2025-09-27T06:04:28Z) - HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds [0.0]
Large language models (LLMs) have shown remarkable capabilities in isolated step-by-step reasoning tasks such as mathematics and programming.<n>But their proficiency in long-horizon planning, where solutions require extended, structured sequences of interdependent actions, remains underexplored.<n>We introduce HeroBench, a novel benchmark designed specifically to evaluate long-horizon planning and structured reasoning within complex RPG-inspired virtual worlds.
arXiv Detail & Related papers (2025-08-18T09:59:02Z) - CREW-WILDFIRE: Benchmarking Agentic Multi-Agent Collaborations at Scale [4.464959191643012]
We introduce CREW-Wildfire, an open-source benchmark designed to evaluate next-generation multi-agent Agentic AI frameworks.<n> CREW-Wildfire offers procedurally generated wildfire response scenarios featuring large maps, heterogeneous agents, partial observability, dynamics, and long-horizon planning objectives.<n>We implement and evaluate several state-of-the-art LLM-based multi-agent Agentic AI frameworks, uncovering significant performance gaps.
arXiv Detail & Related papers (2025-07-07T16:33:42Z) - Benchmarking LLMs' Swarm intelligence [51.648605206159125]
Large Language Models (LLMs) show potential for complex reasoning, yet their capacity for emergent coordination in Multi-Agent Systems (MAS) remains largely unexplored.<n>We introduce SwarmBench, a novel benchmark designed to systematically evaluate tasks of LLMs acting as decentralized agents.<n>We propose metrics for coordination effectiveness and analyze emergent group dynamics.
arXiv Detail & Related papers (2025-05-07T12:32:01Z) - Hierarchical Planning for Complex Tasks with Knowledge Graph-RAG and Symbolic Verification [5.727096041675994]
Large Language Models (LLMs) have shown promise as robotic planners but often struggle with long-horizon and complex tasks.<n>We propose a neuro-symbolic approach that enhances LLMs-based planners with Knowledge Graph-based RAG for hierarchical plan generation.
arXiv Detail & Related papers (2025-04-06T18:36:30Z) - ET-Plan-Bench: Embodied Task-level Planning Benchmark Towards Spatial-Temporal Cognition with Foundation Models [38.89166693142495]
ET-Plan-Bench is a benchmark for embodied task planning using Large Language Models (LLMs)<n>It features a controllable and diverse set of embodied tasks varying in different levels of difficulties and complexities.<n>Our benchmark distinguishes itself as a large-scale, quantifiable, highly automated, and fine-grained diagnostic framework.
arXiv Detail & Related papers (2024-10-02T19:56:38Z) - AI planning in the imagination: High-level planning on learned abstract
search spaces [68.75684174531962]
We propose a new method, called PiZero, that gives an agent the ability to plan in an abstract search space that the agent learns during training.
We evaluate our method on multiple domains, including the traveling salesman problem, Sokoban, 2048, the facility location problem, and Pacman.
arXiv Detail & Related papers (2023-08-16T22:47:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.