Token Is All You Need: Cognitive Planning through Belief-Intent Co-Evolution
- URL: http://arxiv.org/abs/2511.05540v2
- Date: Tue, 11 Nov 2025 18:17:53 GMT
- Title: Token Is All You Need: Cognitive Planning through Belief-Intent Co-Evolution
- Authors: Shiyao Sang,
- Abstract summary: We show that effective planning arises from the co-evolution of belief and intent within a minimal set of semantically rich tokens.<n>Our work establishes a new paradigm: intelligence lies not in pixel fidelity, but in the tokenized duality of belief and intent.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We challenge the long-standing assumption that exhaustive scene modeling is required for high-performance end-to-end autonomous driving (E2EAD). Inspired by cognitive science, we propose that effective planning arises not from reconstructing the world, but from the co-evolution of belief and intent within a minimal set of semantically rich tokens. Experiments on the nuPlan benchmark (720 scenarios, 11k+ samples) reveal three principles: (1) sparse intent tokens alone achieve 0.487 m ADE, demonstrating strong performance without future prediction; (2) conditioning trajectory decoding on predicted future tokens reduces ADE to 0.382 m, a 21.6% improvement, showing that performance emerges from cognitive planning; and (3) explicit reconstruction loss degrades performance, confirming that task-driven belief-intent co-evolution suffices under reliable perception inputs. Crucially, we observe the emergence of cognitive consistency: through prolonged training, the model spontaneously develops stable token dynamics that balance current perception (belief) and future goals (intent). This process, accompanied by "temporal fuzziness," enables robustness under uncertainty and continuous self-optimization. Our work establishes a new paradigm: intelligence lies not in pixel fidelity, but in the tokenized duality of belief and intent. By reframing planning as understanding rather than reaction, TIWM bridges the gap between world models and VLA systems, paving the way for foresightful agents that plan through imagination. Note: Numerical comparisons with methods reporting results on nuScenes are indicative only, as nuPlan presents a more challenging planning-focused evaluation.
Related papers
- Self-Correcting VLA: Online Action Refinement via Sparse World Imagination [55.982504915794514]
We propose Self-Correcting VLA (SC-VLA), which achieve self-improvement by intrinsically guiding action refinement through sparse imagination.<n>SC-VLA achieve state-of-the-art performance, yielding the highest task throughput with 16% fewer steps and a 9% higher success rate than the best-performing baselines.
arXiv Detail & Related papers (2026-02-25T06:58:06Z) - Think Fast and Slow: Step-Level Cognitive Depth Adaptation for LLM Agents [49.119608399413806]
Large language models (LLMs) are increasingly deployed as autonomous agents for multi-turn decision-making tasks.<n>This paper introduces Cog, a framework that trains agents to dynamically adapt cognitive depth at each step.<n> Experiments on ALFWorld and ScienceWorld demonstrate that Cog achieves state-of-the-art performance with superior efficiency.
arXiv Detail & Related papers (2026-02-13T06:52:09Z) - TangramSR: Can Vision-Language Models Reason in Continuous Geometric Space? [11.222572150508332]
Humans excel at spatial reasoning tasks like Tangram puzzle assembly through cognitive processes involving mental rotation, iterative refinement, and visual feedback.<n>However, comprehensive experiments across five representative Vision-Language Models (VLMs) reveal systematic failures in continuous geometric reasoning.<n>We introduce a test-time self-refinement framework that combines in-context learning (ICL) with reward-guided feedback loops, inspired by human cognitive processes.
arXiv Detail & Related papers (2026-02-05T11:49:30Z) - Active Intelligence in Video Avatars via Closed-loop World Modeling [55.29966567726842]
Current video avatar generation methods excel at identity preservation and motion alignment but lack genuine agency.<n>We introduce L-IVA (Long-horizon Interactive Visual Avatar), a task and benchmark for evaluating goal-directed planning in generative environments.<n>We also present ORCA, the first framework enabling active intelligence in video avatars.
arXiv Detail & Related papers (2025-12-23T18:59:16Z) - Metacognitive Sensitivity for Test-Time Dynamic Model Selection [0.0]
We propose a new framework for evaluating and leveraging AI metacognition.<n>We introduce meta-d', a psychologically-grounded measure of metacognitive sensitivity, to characterise how reliably a model's confidence predicts its own accuracy.<n>We then use this dynamic sensitivity score as context for a bandit-based arbiter that performs test-time model selection.
arXiv Detail & Related papers (2025-12-11T09:15:05Z) - A Neuro-Symbolic Framework for Reasoning under Perceptual Uncertainty: Bridging Continuous Perception and Discrete Symbolic Planning [1.9236465591431287]
We present a neuro-symbolic framework that explicitly models and propagates uncertainty from perception to planning.<n>We demonstrate the framework's effectiveness on tabletop robotic manipulation as a concrete application.
arXiv Detail & Related papers (2025-11-18T14:38:01Z) - Unleashing Perception-Time Scaling to Multimodal Reasoning Models [60.578179197783754]
Recent advances in inference-time scaling have substantially enhanced the reasoning capabilities of Large Vision-Language Models (LVLMs)<n>Inspired by this success, similar strategies have been applied to multimodal reasoning, yet their impact on visual perception remains unclear.<n>We propose Perception-Time Scaling (PTS), a novel paradigm that encourages token-rich perception and decomposes complex perception problems into intermediate tractable sub-problems.
arXiv Detail & Related papers (2025-10-10T03:17:52Z) - Discrete JEPA: Learning Discrete Token Representations without Reconstruction [23.6286989806018]
Symbolic cornerstone of cognitive intelligence lies in extracting hidden patterns from observations.<n>We propose Discrete-JEPA, extending latent predictive coding framework with semantic tokenization.<n>Our approach promises a significant impact for advancing world modeling and planning capabilities in artificial intelligence systems.
arXiv Detail & Related papers (2025-06-17T10:15:17Z) - Know Where You're Uncertain When Planning with Multimodal Foundation Models: A Formal Framework [54.40508478482667]
We present a comprehensive framework to disentangle, quantify, and mitigate uncertainty in perception and plan generation.<n>We propose methods tailored to the unique properties of perception and decision-making.<n>We show that our uncertainty disentanglement framework reduces variability by up to 40% and enhances task success rates by 5% compared to baselines.
arXiv Detail & Related papers (2024-11-03T17:32:00Z) - Uncertainty-boosted Robust Video Activity Anticipation [72.14155465769201]
Video activity anticipation aims to predict what will happen in the future, embracing a broad application prospect ranging from robot vision to autonomous driving.
Despite the recent progress, the data uncertainty issue, reflected as the content evolution process and dynamic correlation in event labels, has been somehow ignored.
We propose an uncertainty-boosted robust video activity anticipation framework, which generates uncertainty values to indicate the credibility of the anticipation results.
arXiv Detail & Related papers (2024-04-29T12:31:38Z) - ExACT: Language-guided Conceptual Reasoning and Uncertainty Estimation for Event-based Action Recognition and More [7.797154022794006]
We propose ExACT, a novel approach that tackles event-based action recognition from a cross-modal conceptualizing perspective.
Experiments show that our ExACT achieves superior recognition accuracy of 94.83%(+2.23%), 90.10%(+37.47%) and 67.24% on PAF, HARDVS and our SeAct datasets respectively.
arXiv Detail & Related papers (2024-03-19T08:15:53Z) - Understanding Self-Predictive Learning for Reinforcement Learning [61.62067048348786]
We study the learning dynamics of self-predictive learning for reinforcement learning.
We propose a novel self-predictive algorithm that learns two representations simultaneously.
arXiv Detail & Related papers (2022-12-06T20:43:37Z) - Exploring the Trade-off between Plausibility, Change Intensity and
Adversarial Power in Counterfactual Explanations using Multi-objective
Optimization [73.89239820192894]
We argue that automated counterfactual generation should regard several aspects of the produced adversarial instances.
We present a novel framework for the generation of counterfactual examples.
arXiv Detail & Related papers (2022-05-20T15:02:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.