Related papers: Tracking Object Positions in Reinforcement Learning: A Metric for Keypoint Detection (extended version)

Tracking Object Positions in Reinforcement Learning: A Metric for Keypoint Detection (extended version)

URL: http://arxiv.org/abs/2312.00592v3
Date: Tue, 2 Jul 2024 09:09:19 GMT
Title: Tracking Object Positions in Reinforcement Learning: A Metric for Keypoint Detection (extended version)
Authors: Emma Cramer, Jonas Reiher, Sebastian Trimpe,
Abstract summary: Reinforcement learning (RL) for robot control typically requires a detailed representation of the environment state. Keypoint detectors, such as spatial autoencoders (SAEs), are a common approach to extracting a low-dimensional representation from high-dimensional image data.
Score: 5.467140383171385
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning (RL) for robot control typically requires a detailed representation of the environment state, including information about task-relevant objects not directly measurable. Keypoint detectors, such as spatial autoencoders (SAEs), are a common approach to extracting a low-dimensional representation from high-dimensional image data. SAEs aim at spatial features such as object positions, which are often useful representations in robotic RL. However, whether an SAE is actually able to track objects in the scene and thus yields a spatial state representation well suited for RL tasks has rarely been examined due to a lack of established metrics. In this paper, we propose to assess the performance of an SAE instance by measuring how well keypoints track ground truth objects in images. We present a computationally lightweight metric and use it to evaluate common baseline SAE architectures on image data from a simulated robot task. We find that common SAEs differ substantially in their spatial extraction capability. Furthermore, we validate that SAEs that perform well in our metric achieve superior performance when used in downstream RL. Thus, our metric is an effective and lightweight indicator of RL performance before executing expensive RL training. Building on these insights, we identify three key modifications of SAE architectures to improve tracking performance.

Related papers

APR-Transformer: Initial Pose Estimation for Localization in Complex Environments through Absolute Pose Regression [3.2584852202495806]
In this paper, we introduce APR-Transformer, a model architecture inspired by state-of-the-art methods.<n>We demonstrate that our proposed method achieves state-of-the-art performance on established benchmark datasets.
arXiv Detail & Related papers (2025-05-14T13:06:42Z)
From Text to Space: Mapping Abstract Spatial Models in LLMs during a Grid-World Navigation Task [0.0]
We investigate the influence of different text-based spatial representations on large language models (LLMs) performance and internal activations in a grid-world navigation task. Our experiments reveal that cartesian representations of space consistently yield higher success rates and path efficiency, with performance scaling markedly with model size. This work advances our understanding of how LLMs process spatial information and provides valuable insights for developing more interpretable and robust agentic AI systems.
arXiv Detail & Related papers (2025-02-23T19:09:01Z)
DistFormer: Enhancing Local and Global Features for Monocular Per-Object Distance Estimation [35.6022448037063]
Per-object distance estimation is crucial in safety-critical applications such as autonomous driving, surveillance, and robotics. Existing approaches rely on two scales: local information (i.e., the bounding box proportions) or global information. Our work aims to strengthen both local and global cues.
arXiv Detail & Related papers (2024-01-06T10:56:36Z)
Localizing Active Objects from Egocentric Vision with Symbolic World Knowledge [62.981429762309226]
The ability to actively ground task instructions from an egocentric view is crucial for AI agents to accomplish tasks or assist humans virtually. We propose to improve phrase grounding models' ability on localizing the active objects by: learning the role of objects undergoing change and extracting them accurately from the instructions. We evaluate our framework on Ego4D and Epic-Kitchens datasets.
arXiv Detail & Related papers (2023-10-23T16:14:05Z)
Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner. We design a semantic-guided self-supervised learning model to extract high-level semantic features from images. We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z)
INFOrmation Prioritization through EmPOWERment in Visual Model-Based RL [90.06845886194235]
We propose a modified objective for model-based reinforcement learning (RL) We integrate a term inspired by variational empowerment into a state-space model based on mutual information. We evaluate the approach on a suite of vision-based robot control tasks with natural video backgrounds.
arXiv Detail & Related papers (2022-04-18T23:09:23Z)
Object Manipulation via Visual Target Localization [64.05939029132394]
Training agents to manipulate objects, poses many challenges. We propose an approach that explores the environment in search for target objects, computes their 3D coordinates once they are located, and then continues to estimate their 3D locations even when the objects are not visible. Our evaluations show a massive 3x improvement in success rate over a model that has access to the same sensory suite.
arXiv Detail & Related papers (2022-03-15T17:59:01Z)
Reinforcement Learning for Sparse-Reward Object-Interaction Tasks in a First-person Simulated 3D Environment [73.9469267445146]
First-person object-interaction tasks in high-fidelity, 3D, simulated environments such as the AI2Thor pose significant sample-efficiency challenges for reinforcement learning agents. We show that one can learn object-interaction tasks from scratch without supervision by learning an attentive object-model as an auxiliary task.
arXiv Detail & Related papers (2020-10-28T19:27:26Z)
Learning Invariant Representations for Reinforcement Learning without Reconstruction [98.33235415273562]
We study how representation learning can accelerate reinforcement learning from rich observations, such as images, without relying either on domain knowledge or pixel-reconstruction. Bisimulation metrics quantify behavioral similarity between states in continuous MDPs. We demonstrate the effectiveness of our method at disregarding task-irrelevant information using modified visual MuJoCo tasks.
arXiv Detail & Related papers (2020-06-18T17:59:35Z)
Deflating Dataset Bias Using Synthetic Data Augmentation [8.509201763744246]
State-of-the-art methods for most vision tasks for Autonomous Vehicles (AVs) rely on supervised learning. The goal of this paper is to investigate the use of targeted synthetic data augmentation for filling gaps in real datasets for vision tasks. Empirical studies on three different computer vision tasks of practical use to AVs consistently show that having synthetic data in the training mix provides a significant boost in cross-dataset generalization performance.
arXiv Detail & Related papers (2020-04-28T21:56:10Z)
Acceleration of Actor-Critic Deep Reinforcement Learning for Visual Grasping in Clutter by State Representation Learning Based on Disentanglement of a Raw Input Image [4.970364068620608]
Actor-critic deep reinforcement learning (RL) methods typically perform very poorly when grasping diverse objects. We employ state representation learning (SRL), where we encode essential information first for subsequent use in RL. We found that preprocessing based on the disentanglement of a raw input image is the key to effectively capturing a compact representation.
arXiv Detail & Related papers (2020-02-27T03:58:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.