Accelerating Multi-modal LLM Gaming Performance via Input Prediction and Mishit Correction
- URL: http://arxiv.org/abs/2512.17250v1
- Date: Fri, 19 Dec 2025 05:34:52 GMT
- Title: Accelerating Multi-modal LLM Gaming Performance via Input Prediction and Mishit Correction
- Authors: Ziyang Lin, Zixuan Sun, Sanhorn Chen, Xiaoyang Chen, Roy Zhao,
- Abstract summary: Real-time sequential control agents are often bottlenecked by inference latency.<n>We propose a framework that adapts the predict-then-verify philosophy of speculative execution to model-based control with TD-MPC2.<n>We show that our method reduces the number of planning inferences from 500 to 282, improves end-to-end step latency by 25 percent, and maintains strong control performance with only a 7.1 percent return reduction.
- Score: 4.323124094061299
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Real-time sequential control agents are often bottlenecked by inference latency. Even modest per-step planning delays can destabilize control and degrade overall performance. We propose a speculation-and-correction framework that adapts the predict-then-verify philosophy of speculative execution to model-based control with TD-MPC2. At each step, a pretrained world model and latent-space MPC planner generate a short-horizon action queue together with predicted latent rollouts, allowing the agent to execute multiple planned actions without immediate replanning. When a new observation arrives, the system measures the mismatch between the encoded real latent state and the queued predicted latent. For small to moderate mismatch, a lightweight learned corrector applies a residual update to the speculative action, distilled offline from a replanning teacher. For large mismatch, the agent safely falls back to full replanning and clears stale action queues. We study both a gated two-tower MLP corrector and a temporal Transformer corrector to address local errors and systematic drift. Experiments on the DMC Humanoid-Walk task show that our method reduces the number of planning inferences from 500 to 282, improves end-to-end step latency by 25 percent, and maintains strong control performance with only a 7.1 percent return reduction. Ablation results demonstrate that speculative execution without correction is unreliable over longer horizons, highlighting the necessity of mismatch-aware correction for robust latency reduction.
Related papers
- STEP: Warm-Started Visuomotor Policies with Spatiotemporal Consistency Prediction [16.465783114087223]
iterative denoising leads to substantial inference latency, limiting control frequency in real-time closed-loop systems.<n>We propose STEP, a lightweighttemporal consistency prediction mechanism to construct high-quality warm-start actions.<n> STEP with 2 steps can achieve an average 21.6% and 27.5% higher success rate than BRIDGER and DDIM on the RoboMimic benchmark and real-world tasks.
arXiv Detail & Related papers (2026-02-09T03:50:40Z) - DLLM Agent: See Farther, Run Faster [94.74432470237817]
Diffusion large language models (DLLMs) have emerged as an alternative to autoregressive (AR) decoding with appealing efficiency and modeling properties.<n>We study this in a controlled setting by instantiatingDLLM and AR backbones within the same agent workflow.<n>We find thatDLLM Agents are on average over 30% faster end to end than AR agents, with some cases exceeding 8x speedup.
arXiv Detail & Related papers (2026-02-07T09:01:18Z) - ZIP-RC: Optimizing Test-Time Compute via Zero-Overhead Joint Reward-Cost Prediction [57.799425838564]
We present ZIP-RC, an adaptive inference method that equips models with zero-overhead inference-time predictions of reward and cost.<n> ZIP-RC improves accuracy by up to 12% over majority voting at equal or lower average cost.
arXiv Detail & Related papers (2025-12-01T09:44:31Z) - The Hidden Cost of Approximation in Online Mirror Descent [56.99972253009168]
Online mirror descent (OMD) is a fundamental algorithmic paradigm that underlies many algorithms in optimization, machine learning and sequential decision-making.<n>In this work we initiate a systematic study into inexact OMD, and uncover an intricate relation between regularizer smoothness and robustness to approximation errors.
arXiv Detail & Related papers (2025-11-27T10:09:07Z) - Reducing Latency of LLM Search Agent via Speculation-based Algorithm-System Co-Design [35.95362310928356]
LLM-based search agents achieve strong performance but suffer from severe latency.<n>We revisit this bottleneck through the lens of speculation.<n>We present SPAgent, an algorithm-system co-design framework that expands the role of speculation in search agents to reduce latency.
arXiv Detail & Related papers (2025-11-25T08:15:17Z) - DeeAD: Dynamic Early Exit of Vision-Language Action for Efficient Autonomous Driving [20.235153433297384]
Vision-Language Action (VLA) models unify perception, reasoning, and trajectory generation for autonomous driving, but suffer from significant inference latency due to deep transformer stacks.<n>We present DeeAD, a training-free, action-guided early-exit framework that accelerates VLA planning by evaluating the physical feasibility of intermediate trajectories.<n>Experiments on the Bench2Drive benchmark demonstrate up to 28% transformer-layer sparsity and 29% latency reduction, while preserving planning quality and safety.
arXiv Detail & Related papers (2025-11-25T07:00:26Z) - Algorithms for dynamic scheduling in manufacturing, towards digital factories Improving Deadline Feasibility and Responsiveness via Temporal Networks [0.0]
Traditional deterministic schedules break down when reality deviates from nominal plans.<n>This thesis combines offline constraint-programming with online temporal-network execution to create schedules that remain feasible under worst-case uncertainty.
arXiv Detail & Related papers (2025-10-16T17:28:25Z) - Metacognitive Self-Correction for Multi-Agent System via Prototype-Guided Next-Execution Reconstruction [58.51530390018909]
Large Language Model based multi-agent systems excel at collaborative problem solving but remain brittle to cascading errors.<n>We present MASC, a metacognitive framework that endows MAS with real-time, unsupervised, step-level error detection and self-correction.
arXiv Detail & Related papers (2025-10-16T05:35:37Z) - Adaptive Rescheduling in Prefill-Decode Disaggregated LLM Inference [29.19884207604898]
Large Language Model (LLM) inference has emerged as a fundamental paradigm.<n>In this paper, we propose ARES, an adaptive decoding rescheduling system powered by length prediction to anticipate future workloads.
arXiv Detail & Related papers (2025-10-15T15:29:08Z) - Intra-request branch orchestration for efficient LLM reasoning [52.68946975865865]
Large Language Models (LLMs) increasingly rely on inference-time reasoning algorithms to improve accuracy on complex tasks.<n>Prior work has largely focused on reducing token usage, often at the expense of accuracy, while overlooking other latency factors.<n>We present DUCHESS, an LLM serving system that reduces cost and latency without sacrificing accuracy through intra-request branch orchestration guided by predictions.
arXiv Detail & Related papers (2025-09-29T15:52:08Z) - Centaur: Robust End-to-End Autonomous Driving with Test-Time Training [84.78837437133234]
We propose Centaur, which updates a planner's behavior via test-time training without relying on hand-engineered rules or cost functions.<n>We develop a novel uncertainty measure, called Cluster Entropy, which is simple, interpretable, and compatible with state-of-the-art planning algorithms.
arXiv Detail & Related papers (2025-03-14T17:59:41Z) - AdaShadow: Responsive Test-time Model Adaptation in Non-stationary Mobile Environments [24.606016498430407]
This paper presents AdaShadow, a responsive test-time adaptation framework for non-stationary mobile data distribution and resource dynamics.
AdaShadow addresses challenges in estimating layer importance and latency, as well as scheduling the optimal layer update plan.
Results show that AdaShadow achieves the best accuracy-latency balance under continual shifts.
arXiv Detail & Related papers (2024-10-10T16:41:39Z) - Low-Precision Reinforcement Learning [63.930246183244705]
Low-precision training has become a popular approach to reduce computation time, memory footprint, and energy consumption in supervised learning.
In this paper we consider continuous control with the state-of-the-art SAC agent and demonstrate that a na"ive adaptation of low-precision methods from supervised learning fails.
arXiv Detail & Related papers (2021-02-26T16:16:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.