CombatVLA: An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action Role-Playing Games
- URL: http://arxiv.org/abs/2503.09527v1
- Date: Wed, 12 Mar 2025 16:42:26 GMT
- Title: CombatVLA: An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action Role-Playing Games
- Authors: Peng Chen, Pi Bu, Yingyao Wang, Xinyi Wang, Ziming Wang, Jie Guo, Yingxiu Zhao, Qi Zhu, Jun Song, Siran Yang, Jiamang Wang, Bo Zheng,
- Abstract summary: We introduce CombatVLA, an efficient VLA model optimized for combat tasks in 3D action role-playing games (ARPGs)<n>Specifically, our CombatVLA is a 3B model trained on video-action pairs collected by an action tracker, where the data is formatted as action-of-thought sequences.<n> Experimental results demonstrate that CombatVLA not only outperforms all existing models on the combat understanding benchmark but also achieves a 50-fold acceleration in game combat.
- Score: 45.5522574590016
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in Vision-Language-Action models (VLAs) have expanded the capabilities of embodied intelligence. However, significant challenges remain in real-time decision-making in complex 3D environments, which demand second-level responses, high-resolution perception, and tactical reasoning under dynamic conditions. To advance the field, we introduce CombatVLA, an efficient VLA model optimized for combat tasks in 3D action role-playing games(ARPGs). Specifically, our CombatVLA is a 3B model trained on video-action pairs collected by an action tracker, where the data is formatted as action-of-thought (AoT) sequences. Thereafter, CombatVLA seamlessly integrates into an action execution framework, allowing efficient inference through our truncated AoT strategy. Experimental results demonstrate that CombatVLA not only outperforms all existing models on the combat understanding benchmark but also achieves a 50-fold acceleration in game combat. Moreover, it has a higher task success rate than human players. We will open-source all resources, including the action tracker, dataset, benchmark, model weights, training code, and the implementation of the framework at https://combatvla.github.io/.
Related papers
- NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks [37.03331507197761]
Existing Visual-Language-Action (VLA) models have shown promising performance in zero-shot scenarios.
These models typically suffer from high computational overhead due to their large sizes.
We propose NORA, a model designed to reduce computational overhead while maintaining strong task performance.
arXiv Detail & Related papers (2025-04-28T14:47:34Z) - PointVLA: Injecting the 3D World into Vision-Language-Action Models [10.758939578236582]
We propose PointVLA, a framework that enhances pre-trained vision-language-action models with point cloud inputs without requiring retraining.<n>Our method freezes the vanilla action expert and injects 3D features via a lightweight modular block.<n>PointVLA outperforms state-of-the-art 2D imitation learning methods across both simulated and real-world robotic tasks.
arXiv Detail & Related papers (2025-03-10T16:32:41Z) - 3D-Grounded Vision-Language Framework for Robotic Task Planning: Automated Prompt Synthesis and Supervised Reasoning [2.6670748466660523]
Vision-language models (VLMs) have achieved remarkable success in scene understanding and perception tasks.<n>VLMs lack robust 3D scene localization capabilities, limiting their effectiveness in fine-grained robotic operations.<n>We propose a novel framework that integrates a 2D prompt synthesis module by mapping 2D images to point clouds, and incorporates a small language model (SLM) for supervising VLM outputs.
arXiv Detail & Related papers (2025-02-13T02:40:19Z) - CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation [100.25567121604382]
Vision-Language-Action (VLA) models have improved robotic manipulation in terms of language-guided task execution and generalization to unseen scenarios.<n>We present a new advanced VLA architecture derived from Vision-Language-Models (VLM)<n>We show that our model not only significantly surpasses existing VLAs in task performance and but also exhibits remarkable adaptation to new robots and generalization to unseen objects and backgrounds.
arXiv Detail & Related papers (2024-11-29T12:06:03Z) - Towards Generalizable Vision-Language Robotic Manipulation: A Benchmark and LLM-guided 3D Policy [68.50785963043161]
GemBench is a novel benchmark to assess generalization capabilities of vision-language robotic manipulation policies.<n>We present 3D-LOTUS++, a framework that integrates 3D-LOTUS's motion planning capabilities with the task planning capabilities of LLMs.<n>3D-LOTUS++ achieves state-of-the-art performance on novel tasks of GemBench, setting a new standard for generalization in robotic manipulation.
arXiv Detail & Related papers (2024-10-02T09:02:34Z) - Can VLMs Play Action Role-Playing Games? Take Black Myth Wukong as a Study Case [20.14197375326218]
This research aims to provide new insights and directions for applying multimodal agents in complex action game environments.
We select an ARPG, Black Myth: Wukong'', as a research platform to explore the capability boundaries of existing vision language models.
We will release a human operation dataset containing recorded gameplay videos and operation logs, including mouse and keyboard actions.
arXiv Detail & Related papers (2024-09-19T16:30:25Z) - Open-Vocabulary Spatio-Temporal Action Detection [59.91046192096296]
Open-vocabulary-temporal action detection (OV-STAD) is an important fine-grained video understanding task.
OV-STAD requires training a model on a limited set of base classes with box and label supervision.
To better adapt the holistic VLM for the fine-grained action detection task, we carefully fine-tune it on the localized video region-text pairs.
arXiv Detail & Related papers (2024-05-17T14:52:47Z) - LocATe: End-to-end Localization of Actions in 3D with Transformers [91.28982770522329]
LocATe is an end-to-end approach that jointly localizes and recognizes actions in a 3D sequence.
Unlike transformer-based object-detection and classification models which consider image or patch features as input, LocATe's transformer model is capable of capturing long-term correlations between actions in a sequence.
We introduce a new, challenging, and more realistic benchmark dataset, BABEL-TAL-20 (BT20), where the performance of state-of-the-art methods is significantly worse.
arXiv Detail & Related papers (2022-03-21T03:35:32Z) - Learning a Weakly-Supervised Video Actor-Action Segmentation Model with
a Wise Selection [97.98805233539633]
We address weakly-supervised video actor-action segmentation (VAAS)
We propose a general Weakly-Supervised framework with a Wise Selection of training samples and model evaluation criterion (WS2)
WS2 achieves state-of-the-art performance on both weakly-supervised VOS and VAAS tasks.
arXiv Detail & Related papers (2020-03-29T21:15:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.