Related papers: Can LLMs Do Rocket Science? Exploring the Limits of Complex Reasoning with GTOC 12

Can LLMs Do Rocket Science? Exploring the Limits of Complex Reasoning with GTOC 12

URL: http://arxiv.org/abs/2602.03630v1
Date: Tue, 03 Feb 2026 15:18:26 GMT
Title: Can LLMs Do Rocket Science? Exploring the Limits of Complex Reasoning with GTOC 12
Authors: Iñaki del Campo, Pablo Cuervo, Victor Rodriguez-Fernandez, Roberto Armellin, Jack Yarndley,
Abstract summary: Large Language Models (LLMs) have demonstrated remarkable proficiency in code generation and general reasoning.<n>This study investigates the limits of current AI agents by evaluating them against the 12th Global Trajectory Optimization Competition (GTOC 12)<n>We adapt the MLE-Bench framework to the domain of orbital mechanics and deploy an AIDE-based agent architecture to autonomously generate and refine mission solutions.
Score: 0.1710384116816033
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have demonstrated remarkable proficiency in code generation and general reasoning, yet their capacity for autonomous multi-stage planning in high-dimensional, physically constrained environments remains an open research question. This study investigates the limits of current AI agents by evaluating them against the 12th Global Trajectory Optimization Competition (GTOC 12), a complex astrodynamics challenge requiring the design of a large-scale asteroid mining campaign. We adapt the MLE-Bench framework to the domain of orbital mechanics and deploy an AIDE-based agent architecture to autonomously generate and refine mission solutions. To assess performance beyond binary validity, we employ an "LLM-as-a-Judge" methodology, utilizing a rubric developed by domain experts to evaluate strategic viability across five structural categories. A comparative analysis of models, ranging from GPT-4-Turbo to reasoning-enhanced architectures like Gemini 2.5 Pro, and o3, reveals a significant trend: the average strategic viability score has nearly doubled in the last two years (rising from 9.3 to 17.2 out of 26). However, we identify a critical capability gap between strategy and execution. While advanced models demonstrate sophisticated conceptual understanding, correctly framing objective functions and mission architectures, they consistently fail at implementation due to physical unit inconsistencies, boundary condition errors, and inefficient debugging loops. We conclude that, while current LLMs often demonstrate sufficient knowledge and intelligence to tackle space science tasks, they remain limited by an implementation barrier, functioning as powerful domain facilitators rather than fully autonomous engineers.

Related papers

EmboCoach-Bench: Benchmarking AI Agents on Developing Embodied Robots [68.29056647487519]
Embodied AI is fueled by high-fidelity simulation and large-scale data collection.<n>However, this scaling capability remains bottlenecked by a reliance on labor-intensive manual oversight.<n>We introduce textscEmboCoach-Bench, a benchmark evaluating the capacity of LLM agents to autonomously engineer embodied policies.
arXiv Detail & Related papers (2026-01-29T11:33:49Z)
Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering [59.18634614089481]
We present ML-Master 2.0, an autonomous agent that masters ultra-long-horizon machine learning engineering (MLE)<n>By reframing context management as a process of cognitive accumulation, our approach introduces Hierarchical Cognitive Caching (HCC)<n>HCC allows agents to decouple immediate execution from long-term experimental strategy.<n>In evaluations on OpenAI's MLE-Bench under 24-hour budgets, ML-Master 2.0 achieves a state-of-the-art medal rate of 56.44%.
arXiv Detail & Related papers (2026-01-15T13:52:04Z)
The Hierarchy of Agentic Capabilities: Evaluating Frontier Models on Realistic RL Environments [0.11586753333439907]
We present an empirical study evaluating frontier AI models on 150 workplace tasks within a realistic e-commerce RL environment from Surge.<n>Our analysis reveals an empirically-derived emphhierarchy of agentic capabilities that models must master for real-world deployment.<n>Weaker models struggle with fundamental tool use and planning, whereas stronger models primarily fail on tasks requiring contextual inference beyond explicit instructions.
arXiv Detail & Related papers (2026-01-13T23:49:06Z)
PILOT: Planning via Internalized Latent Optimization Trajectories for Large Language Models [51.43746425777865]
Large Language Models (LLMs) often lack the capacity to formulate global strategies, leading to error propagation in long-horizon tasks.<n>We propose PILOT, a framework designed to internalize the strategic oversight of large models into intrinsic Latent Guidance.
arXiv Detail & Related papers (2026-01-07T12:38:56Z)
Holistic Evaluation of Multimodal LLMs on Spatial Intelligence [81.2547965083228]
We propose EASI for holistic Evaluation of multimodAl LLMs on Spatial Intelligence.<n>We conduct the study across eight key benchmarks, at a cost exceeding ten billion total tokens.<n>Our empirical study then reveals that GPT-5 demonstrates unprecedented strength in spatial intelligence (SI), yet (2) still falls short of human performance significantly across a broad spectrum of SI-tasks.
arXiv Detail & Related papers (2025-08-18T17:55:17Z)
HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds [0.0]
Large language models (LLMs) have shown remarkable capabilities in isolated step-by-step reasoning tasks such as mathematics and programming.<n>But their proficiency in long-horizon planning, where solutions require extended, structured sequences of interdependent actions, remains underexplored.<n>We introduce HeroBench, a novel benchmark designed specifically to evaluate long-horizon planning and structured reasoning within complex RPG-inspired virtual worlds.
arXiv Detail & Related papers (2025-08-18T09:59:02Z)
LLM4CMO: Large Language Model-aided Algorithm Design for Constrained Multiobjective Optimization [54.35609820607923]
Large language models (LLMs) offer new opportunities for assisting with algorithm design.<n>We propose LLM4CMO, a novel CMOEA based on a dual-population, two-stage framework.<n>LLMs can serve as efficient co-designers in the development of complex evolutionary optimization algorithms.
arXiv Detail & Related papers (2025-08-16T02:00:57Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.