Related papers: CombatVLA: An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action Role-Playing Games

CombatVLA: An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action Role-Playing Games

URL: http://arxiv.org/abs/2503.09527v1
Date: Wed, 12 Mar 2025 16:42:26 GMT
Title: CombatVLA: An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action Role-Playing Games
Authors: Peng Chen, Pi Bu, Yingyao Wang, Xinyi Wang, Ziming Wang, Jie Guo, Yingxiu Zhao, Qi Zhu, Jun Song, Siran Yang, Jiamang Wang, Bo Zheng,
Abstract summary: We introduce CombatVLA, an efficient VLA model optimized for combat tasks in 3D action role-playing games (ARPGs)<n>Specifically, our CombatVLA is a 3B model trained on video-action pairs collected by an action tracker, where the data is formatted as action-of-thought sequences.<n> Experimental results demonstrate that CombatVLA not only outperforms all existing models on the combat understanding benchmark but also achieves a 50-fold acceleration in game combat.
Score: 45.5522574590016
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in Vision-Language-Action models (VLAs) have expanded the capabilities of embodied intelligence. However, significant challenges remain in real-time decision-making in complex 3D environments, which demand second-level responses, high-resolution perception, and tactical reasoning under dynamic conditions. To advance the field, we introduce CombatVLA, an efficient VLA model optimized for combat tasks in 3D action role-playing games(ARPGs). Specifically, our CombatVLA is a 3B model trained on video-action pairs collected by an action tracker, where the data is formatted as action-of-thought (AoT) sequences. Thereafter, CombatVLA seamlessly integrates into an action execution framework, allowing efficient inference through our truncated AoT strategy. Experimental results demonstrate that CombatVLA not only outperforms all existing models on the combat understanding benchmark but also achieves a 50-fold acceleration in game combat. Moreover, it has a higher task success rate than human players. We will open-source all resources, including the action tracker, dataset, benchmark, model weights, training code, and the implementation of the framework at https://combatvla.github.io/.

Related papers

LEO-VL: Towards 3D Vision-Language Generalists via Data Scaling with Efficient Representation [68.80467240885642]
A key obstacle to developing 3D-VL generalists lies in data scalability, hindered by the lack of an efficient scene representation.<n>We propose LEO-VL, a 3D-VL model built upon condensed feature grid (CFG), an efficient scene representation that bridges 2D perception and 3D spatial structure.<n>We curate over 700k high-quality 3D-VL data spanning four domains of real-world indoor scenes and five tasks such as captioning and dialogue.
arXiv Detail & Related papers (2025-06-11T16:56:34Z)
UniVLA: Learning to Act Anywhere with Task-centric Latent Actions [32.83715417294052]
UniVLA is a new framework for learning cross-embodiment vision-language-action (VLA) policies.<n>We derive task-centric action representations from videos with a latent action model.<n>We obtain state-of-the-art results across multiple manipulation and navigation benchmarks, as well as real-robot deployments.
arXiv Detail & Related papers (2025-05-09T15:11:13Z)
NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks [37.03331507197761]
Existing Visual-Language-Action (VLA) models have shown promising performance in zero-shot scenarios. These models typically suffer from high computational overhead due to their large sizes. We propose NORA, a model designed to reduce computational overhead while maintaining strong task performance.
arXiv Detail & Related papers (2025-04-28T14:47:34Z)
PointVLA: Injecting the 3D World into Vision-Language-Action Models [10.758939578236582]
We propose PointVLA, a framework that enhances pre-trained vision-language-action models with point cloud inputs without requiring retraining.<n>Our method freezes the vanilla action expert and injects 3D features via a lightweight modular block.<n>PointVLA outperforms state-of-the-art 2D imitation learning methods across both simulated and real-world robotic tasks.
arXiv Detail & Related papers (2025-03-10T16:32:41Z)
3D-Grounded Vision-Language Framework for Robotic Task Planning: Automated Prompt Synthesis and Supervised Reasoning [2.6670748466660523]
Vision-language models (VLMs) have achieved remarkable success in scene understanding and perception tasks.<n>VLMs lack robust 3D scene localization capabilities, limiting their effectiveness in fine-grained robotic operations.<n>We propose a novel framework that integrates a 2D prompt synthesis module by mapping 2D images to point clouds, and incorporates a small language model (SLM) for supervising VLM outputs.
arXiv Detail & Related papers (2025-02-13T02:40:19Z)
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation [100.25567121604382]
Vision-Language-Action (VLA) models have improved robotic manipulation in terms of language-guided task execution and generalization to unseen scenarios.<n>We present a new advanced VLA architecture derived from Vision-Language-Models (VLM)<n>We show that our model not only significantly surpasses existing VLAs in task performance and but also exhibits remarkable adaptation to new robots and generalization to unseen objects and backgrounds.
arXiv Detail & Related papers (2024-11-29T12:06:03Z)
Towards Generalizable Vision-Language Robotic Manipulation: A Benchmark and LLM-guided 3D Policy [68.50785963043161]
GemBench is a novel benchmark to assess generalization capabilities of vision-language robotic manipulation policies.<n>We present 3D-LOTUS++, a framework that integrates 3D-LOTUS's motion planning capabilities with the task planning capabilities of LLMs.<n>3D-LOTUS++ achieves state-of-the-art performance on novel tasks of GemBench, setting a new standard for generalization in robotic manipulation.
arXiv Detail & Related papers (2024-10-02T09:02:34Z)
Can VLMs Play Action Role-Playing Games? Take Black Myth Wukong as a Study Case [20.14197375326218]
This research aims to provide new insights and directions for applying multimodal agents in complex action game environments. We select an ARPG, Black Myth: Wukong'', as a research platform to explore the capability boundaries of existing vision language models. We will release a human operation dataset containing recorded gameplay videos and operation logs, including mouse and keyboard actions.
arXiv Detail & Related papers (2024-09-19T16:30:25Z)
Open-Vocabulary Spatio-Temporal Action Detection [59.91046192096296]
Open-vocabulary-temporal action detection (OV-STAD) is an important fine-grained video understanding task. OV-STAD requires training a model on a limited set of base classes with box and label supervision. To better adapt the holistic VLM for the fine-grained action detection task, we carefully fine-tune it on the localized video region-text pairs.
arXiv Detail & Related papers (2024-05-17T14:52:47Z)
LocATe: End-to-end Localization of Actions in 3D with Transformers [91.28982770522329]
LocATe is an end-to-end approach that jointly localizes and recognizes actions in a 3D sequence. Unlike transformer-based object-detection and classification models which consider image or patch features as input, LocATe's transformer model is capable of capturing long-term correlations between actions in a sequence. We introduce a new, challenging, and more realistic benchmark dataset, BABEL-TAL-20 (BT20), where the performance of state-of-the-art methods is significantly worse.
arXiv Detail & Related papers (2022-03-21T03:35:32Z)
Learning a Weakly-Supervised Video Actor-Action Segmentation Model with a Wise Selection [97.98805233539633]
We address weakly-supervised video actor-action segmentation (VAAS) We propose a general Weakly-Supervised framework with a Wise Selection of training samples and model evaluation criterion (WS2) WS2 achieves state-of-the-art performance on both weakly-supervised VOS and VAAS tasks.
arXiv Detail & Related papers (2020-03-29T21:15:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.