Active Intelligence in Video Avatars via Closed-loop World Modeling
- URL: http://arxiv.org/abs/2512.20615v1
- Date: Tue, 23 Dec 2025 18:59:16 GMT
- Title: Active Intelligence in Video Avatars via Closed-loop World Modeling
- Authors: Xuanhua He, Tianyu Yang, Ke Cao, Ruiqi Wu, Cheng Meng, Yong Zhang, Zhuoliang Kang, Xiaoming Wei, Qifeng Chen,
- Abstract summary: Current video avatar generation methods excel at identity preservation and motion alignment but lack genuine agency.<n>We introduce L-IVA (Long-horizon Interactive Visual Avatar), a task and benchmark for evaluating goal-directed planning in generative environments.<n>We also present ORCA, the first framework enabling active intelligence in video avatars.
- Score: 55.29966567726842
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current video avatar generation methods excel at identity preservation and motion alignment but lack genuine agency, they cannot autonomously pursue long-term goals through adaptive environmental interaction. We address this by introducing L-IVA (Long-horizon Interactive Visual Avatar), a task and benchmark for evaluating goal-directed planning in stochastic generative environments, and ORCA (Online Reasoning and Cognitive Architecture), the first framework enabling active intelligence in video avatars. ORCA embodies Internal World Model (IWM) capabilities through two key innovations: (1) a closed-loop OTAR cycle (Observe-Think-Act-Reflect) that maintains robust state tracking under generative uncertainty by continuously verifying predicted outcomes against actual generations, and (2) a hierarchical dual-system architecture where System 2 performs strategic reasoning with state prediction while System 1 translates abstract plans into precise, model-specific action captions. By formulating avatar control as a POMDP and implementing continuous belief updating with outcome verification, ORCA enables autonomous multi-step task completion in open-domain scenarios. Extensive experiments demonstrate that ORCA significantly outperforms open-loop and non-reflective baselines in task success rate and behavioral coherence, validating our IWM-inspired design for advancing video avatar intelligence from passive animation to active, goal-oriented behavior.
Related papers
- Self-Correcting VLA: Online Action Refinement via Sparse World Imagination [55.982504915794514]
We propose Self-Correcting VLA (SC-VLA), which achieve self-improvement by intrinsically guiding action refinement through sparse imagination.<n>SC-VLA achieve state-of-the-art performance, yielding the highest task throughput with 16% fewer steps and a 9% higher success rate than the best-performing baselines.
arXiv Detail & Related papers (2026-02-25T06:58:06Z) - Causal World Modeling for Robot Control [56.31803788587547]
Video world models provide the ability to imagine the near future by understanding the causality between actions and visual dynamics.<n>We introduce LingBot-VA, an autoregressive diffusion framework that learns frame prediction and policy execution simultaneously.<n>We evaluate our model on both simulation benchmarks and real-world scenarios, where it shows significant promise in long-horizon manipulation, data efficiency in post-training, and strong generalizability to novel configurations.
arXiv Detail & Related papers (2026-01-29T17:07:43Z) - Act2Goal: From World Model To General Goal-conditioned Policy [14.222177107215648]
Act2Goal is a goal-conditioned manipulation policy that integrates a goal-conditioned visual world model with multi-scale temporal control.<n>We show that Act2Goal improves success rates from 30% to 90% on challenging out-of-distribution tasks within minutes of autonomous interaction.
arXiv Detail & Related papers (2025-12-29T15:28:42Z) - Percept-WAM: Perception-Enhanced World-Awareness-Action Model for Robust End-to-End Autonomous Driving [48.512353531499286]
We introduce Percept-WAM, a perception-enhanced World-Awareness-Action Model that implicitly integrates 2D/3D scene understanding abilities within a single vision-language model (VLM)<n>We propose a grid-conditioned prediction mechanism for dense object perception, incorporating IoU-aware scoring and parallel autoregressive decoding, improving stability in long-tail, far-range, and small-object scenarios.<n>Experiments show that Percept-WAM matches or surpasses classical detectors and segmenters on downstream perception benchmarks, achieving 51.7/58.9 mAP on 2D detection and nuScenes BEV 3D detection
arXiv Detail & Related papers (2025-11-24T15:28:25Z) - ViPER: Empowering the Self-Evolution of Visual Perception Abilities in Vision-Language Model [61.29164681694533]
ViPER is a self-bootstrapping framework specifically designed to enable iterative evolution through self-critiquing and self-prediction.<n>Qwen-Viper consistently demonstrates superior performance across different vision-language scenarios while maintaining generalizability.
arXiv Detail & Related papers (2025-10-28T10:42:57Z) - Dyna-Mind: Learning to Simulate from Experience for Better AI Agents [62.21219817256246]
We argue that current AI agents need ''vicarious trial and error'' - the capacity to mentally simulate alternative futures before acting.<n>We introduce Dyna-Mind, a two-stage training framework that explicitly teaches (V)LM agents to integrate such simulation into their reasoning.
arXiv Detail & Related papers (2025-10-10T17:30:18Z) - AutoDrive-R$^2$: Incentivizing Reasoning and Self-Reflection Capacity for VLA Model in Autonomous Driving [37.260140808367716]
We propose AutoDrive-R$2$, a novel VLA framework that enhances both reasoning and self-reflection capabilities of autonomous driving systems.<n>We first propose an innovative CoT dataset named nuScenesR$2$-6K for supervised fine-tuning.<n>We then employ the Group Relative Policy Optimization (GRPO) algorithm within a physics-grounded reward framework to ensure reliable smoothness and realistic trajectory planning.
arXiv Detail & Related papers (2025-09-02T04:32:24Z) - ViLaD: A Large Vision Language Diffusion Framework for End-to-End Autonomous Driving [14.486548540613791]
We introduce ViLaD, a novel Large Vision Language Diffusion framework for end-to-end autonomous driving.<n>ViLaD enables parallel generation of entire driving decision sequences, significantly reducing computational latency.<n>We conduct comprehensive experiments on the nuScenes dataset, where ViLaD outperforms state-of-the-art autoregressive VLM baselines in both planning accuracy and inference speed.
arXiv Detail & Related papers (2025-08-18T04:01:56Z) - GoViG: Goal-Conditioned Visual Navigation Instruction Generation [69.79110149746506]
We introduce Goal-Conditioned Visual Navigation Instruction Generation (GoViG), a new task that aims to autonomously generate precise and contextually coherent navigation instructions.<n>GoViG exclusively leverages raw egocentric visual data, substantially improving its adaptability to unseen and unstructured environments.
arXiv Detail & Related papers (2025-08-13T07:05:17Z) - Intention-Guided Cognitive Reasoning for Egocentric Long-Term Action Anticipation [52.6091162517921]
INSIGHT is a two-stage framework for egocentric action anticipation.<n>In the first stage, INSIGHT focuses on extracting semantically rich features from hand-object interaction regions.<n>In the second stage, it introduces a reinforcement learning-based module that simulates explicit cognitive reasoning.
arXiv Detail & Related papers (2025-08-03T12:52:27Z) - Domain Generalization using Action Sequences for Egocentric Action Recognition [22.373604443667134]
Egocentric vision, characterized by cameras worn by observers, captures diverse changes in illumination, viewpoint, and environment.<n>We propose a domain generalization approach for Egocentric Action Recognition.<n>By leveraging action sequences, we aim to enhance the model's generalization ability across unseen environments.
arXiv Detail & Related papers (2025-06-21T11:33:08Z) - EASE: Embodied Active Event Perception via Self-Supervised Energy Minimization [6.249768559720122]
Active event perception is essential for embodied intelligence in tasks such as human-AI collaboration, assistive robotics, and autonomous navigation.<n>We propose EASE, a self-supervised framework that unifies representation learning and embodied control through free energy.
arXiv Detail & Related papers (2025-06-20T23:45:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.