Learning Representations in Video Game Agents with Supervised Contrastive Imitation Learning
- URL: http://arxiv.org/abs/2509.11880v1
- Date: Mon, 15 Sep 2025 13:00:29 GMT
- Title: Learning Representations in Video Game Agents with Supervised Contrastive Imitation Learning
- Authors: Carlos Celemin, Joseph Brennan, Pierluigi Vito Amadori, Tim Bradley,
- Abstract summary: This paper introduces a novel application of Supervised Contrastive Learning (SupCon) to Imitation Learning (IL)<n>The goal is to obtain latent representations of the observations that capture better the action-relevant factors.<n> Experiments on the 3D games Astro Bot and Returnal, and multiple 2D Atari games show improved representation quality, faster learning convergence, and better generalization.
- Score: 0.6299766708197881
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces a novel application of Supervised Contrastive Learning (SupCon) to Imitation Learning (IL), with a focus on learning more effective state representations for agents in video game environments. The goal is to obtain latent representations of the observations that capture better the action-relevant factors, thereby modeling better the cause-effect relationship from the observations that are mapped to the actions performed by the demonstrator, for example, the player jumps whenever an obstacle appears ahead. We propose an approach to integrate the SupCon loss with continuous output spaces, enabling SupCon to operate without constraints regarding the type of actions of the environment. Experiments on the 3D games Astro Bot and Returnal, and multiple 2D Atari games show improved representation quality, faster learning convergence, and better generalization compared to baseline models trained only with supervised action prediction loss functions.
Related papers
- COMBAT: Conditional World Models for Behavioral Agent Training [2.7205188669026277]
We introduce COMBAT, a real-time, action-controlled world model trained on the complex 1v1 fighting game Tekken 3.<n>Our work demonstrates that diffusion models can successfully simulate a dynamic opponent that reacts to player actions.<n>We observe the emergence of sophisticated agent behavior by training the model solely on single-player inputs.
arXiv Detail & Related papers (2026-02-28T22:09:04Z) - CLAP: Contrastive Latent Action Pretraining for Learning Vision-Language-Action Models from Human Videos [73.51386721543135]
We propose Contrastive Latent Action Pretraining (CLAP), a framework that aligns the visual latent space from videos with a proprioceptive latent space from robot trajectories.<n>CLAP maps video transitions onto a quantized, physically executable codebook.<n>We introduce a dual-formulation VLA framework offering both CLAP-NTP, an autoregressive model excelling at instruction following and object generalization, and CLAP-RF, a Rectified Flow-based policy designed for high-frequency, precise manipulation.
arXiv Detail & Related papers (2026-01-07T16:26:33Z) - Game-TARS: Pretrained Foundation Models for Scalable Generalist Multimodal Game Agents [56.25101378553328]
We present Game-TARS, a generalist game agent trained with a unified, scalable action space anchored to human-aligned keyboard-mouse inputs.<n>Game-TARS is pre-trained on over 500B tokens with diverse trajectories and multimodal data.<n> Experiments show that Game-TARS achieves about 2 times the success rate over the previous sota model on open-world Minecraft tasks.
arXiv Detail & Related papers (2025-10-27T17:43:51Z) - Playpen: An Environment for Exploring Learning Through Conversational Interaction [81.67330926729015]
We investigate whether Dialogue Games can also serve as a source of feedback signals for learning.<n>We introduce Playpen, an environment for off- and online learning through Dialogue Game self-play.<n>We find that imitation learning through SFT improves performance on unseen instances, but negatively impacts other skills.
arXiv Detail & Related papers (2025-04-11T14:49:33Z) - Prompted Contrast with Masked Motion Modeling: Towards Versatile 3D
Action Representation Learning [33.68311764817763]
We propose Prompted Contrast with Masked Motion Modeling, PCM$rm 3$, for versatile 3D action representation learning.
Our method integrates the contrastive learning and masked prediction tasks in a mutually beneficial manner.
Tests on five downstream tasks under three large-scale datasets are conducted, demonstrating the superior generalization capacity of PCM$rm 3$ compared to the state-of-the-art works.
arXiv Detail & Related papers (2023-08-08T01:27:55Z) - DCT: Dual Channel Training of Action Embeddings for Reinforcement
Learning with Large Discrete Action Spaces [4.168157981135697]
We present a novel framework to efficiently learn action embeddings that simultaneously allow us to reconstruct the original action as well as to predict the expected future state.
We use a trained decoder in conjunction with a standard reinforcement learning algorithm that produces actions in the embedding space.
Empirical results show that the model results in cleaner action embeddings, and the improved representations help learn better policies with earlier convergence.
arXiv Detail & Related papers (2023-06-28T04:32:09Z) - PointACL:Adversarial Contrastive Learning for Robust Point Clouds
Representation under Adversarial Attack [73.3371797787823]
Adversarial contrastive learning (ACL) is considered an effective way to improve the robustness of pre-trained models.
We present our robust aware loss function to train self-supervised contrastive learning framework adversarially.
We validate our method, PointACL on downstream tasks, including 3D classification and 3D segmentation with multiple datasets.
arXiv Detail & Related papers (2022-09-14T22:58:31Z) - Reinforcement Learning with Action-Free Pre-Training from Videos [95.25074614579646]
We introduce a framework that learns representations useful for understanding the dynamics via generative pre-training on videos.
Our framework significantly improves both final performances and sample-efficiency of vision-based reinforcement learning.
arXiv Detail & Related papers (2022-03-25T19:44:09Z) - Self-Regulated Learning for Egocentric Video Activity Anticipation [147.9783215348252]
Self-Regulated Learning (SRL) aims to regulate the intermediate representation consecutively to produce representation that emphasizes the novel information in the frame of the current time-stamp.
SRL sharply outperforms existing state-of-the-art in most cases on two egocentric video datasets and two third-person video datasets.
arXiv Detail & Related papers (2021-11-23T03:29:18Z) - Disentangling Controllable Object through Video Prediction Improves
Visual Reinforcement Learning [82.25034245150582]
In many vision-based reinforcement learning problems, the agent controls a movable object in its visual field.
We propose an end-to-end learning framework to disentangle the controllable object from the observation signal.
The disentangled representation is shown to be useful for RL as additional observation channels to the agent.
arXiv Detail & Related papers (2020-02-21T05:43:34Z) - Delving into 3D Action Anticipation from Streaming Videos [99.0155538452263]
Action anticipation aims to recognize the action with a partial observation.
We introduce several complementary evaluation metrics and present a basic model based on frame-wise action classification.
We also explore multi-task learning strategies by incorporating auxiliary information from two aspects: the full action representation and the class-agnostic action label.
arXiv Detail & Related papers (2019-06-15T10:30:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.