Towards Active Vision for Action Localization with Reactive Control and
Predictive Learning
- URL: http://arxiv.org/abs/2111.05448v1
- Date: Tue, 9 Nov 2021 23:16:55 GMT
- Title: Towards Active Vision for Action Localization with Reactive Control and
Predictive Learning
- Authors: Shubham Trehan, Sathyanarayanan N. Aakur
- Abstract summary: We formulate an energy-based mechanism that combines predictive learning and reactive control to perform active action localization without rewards.
We demonstrate that the proposed approach can generalize to different tasks and environments in a streaming fashion, without explicit rewards or training.
- Score: 8.22379888383833
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual event perception tasks such as action localization have primarily
focused on supervised learning settings under a static observer, i.e., the
camera is static and cannot be controlled by an algorithm. They are often
restricted by the quality, quantity, and diversity of \textit{annotated}
training data and do not often generalize to out-of-domain samples. In this
work, we tackle the problem of active action localization where the goal is to
localize an action while controlling the geometric and physical parameters of
an active camera to keep the action in the field of view without training data.
We formulate an energy-based mechanism that combines predictive learning and
reactive control to perform active action localization without rewards, which
can be sparse or non-existent in real-world environments. We perform extensive
experiments in both simulated and real-world environments on two tasks - active
object tracking and active action localization. We demonstrate that the
proposed approach can generalize to different tasks and environments in a
streaming fashion, without explicit rewards or training. We show that the
proposed approach outperforms unsupervised baselines and obtains competitive
performance compared to those trained with reinforcement learning.
Related papers
- Learning Where to Look: Self-supervised Viewpoint Selection for Active Localization using Geometrical Information [68.10033984296247]
This paper explores the domain of active localization, emphasizing the importance of viewpoint selection to enhance localization accuracy.
Our contributions involve using a data-driven approach with a simple architecture designed for real-time operation, a self-supervised data training method, and the capability to consistently integrate our map into a planning framework tailored for real-world robotics applications.
arXiv Detail & Related papers (2024-07-22T12:32:09Z) - Open-Vocabulary Spatio-Temporal Action Detection [59.91046192096296]
Open-vocabulary-temporal action detection (OV-STAD) is an important fine-grained video understanding task.
OV-STAD requires training a model on a limited set of base classes with box and label supervision.
To better adapt the holistic VLM for the fine-grained action detection task, we carefully fine-tune it on the localized video region-text pairs.
arXiv Detail & Related papers (2024-05-17T14:52:47Z) - Contrastive Learning for Enhancing Robust Scene Transfer in Vision-based
Agile Flight [21.728935597793473]
This work proposes an adaptive multi-pair contrastive learning strategy for visual representation learning that enables zero-shot scene transfer and real-world deployment.
We demonstrate the performance of our approach on the task of agile, vision-based quadrotor flight.
arXiv Detail & Related papers (2023-09-18T15:25:59Z) - ALP: Action-Aware Embodied Learning for Perception [60.64801970249279]
We introduce Action-Aware Embodied Learning for Perception (ALP)
ALP incorporates action information into representation learning through a combination of optimizing a reinforcement learning policy and an inverse dynamics prediction objective.
We show that ALP outperforms existing baselines in several downstream perception tasks.
arXiv Detail & Related papers (2023-06-16T21:51:04Z) - Predictive Experience Replay for Continual Visual Control and
Forecasting [62.06183102362871]
We present a new continual learning approach for visual dynamics modeling and explore its efficacy in visual control and forecasting.
We first propose the mixture world model that learns task-specific dynamics priors with a mixture of Gaussians, and then introduce a new training strategy to overcome catastrophic forgetting.
Our model remarkably outperforms the naive combinations of existing continual learning and visual RL algorithms on DeepMind Control and Meta-World benchmarks with continual visual control tasks.
arXiv Detail & Related papers (2023-03-12T05:08:03Z) - Task-Induced Representation Learning [14.095897879222672]
We evaluate the effectiveness of representation learning approaches for decision making in visually complex environments.
We find that representation learning generally improves sample efficiency on unseen tasks even in visually complex scenes.
arXiv Detail & Related papers (2022-04-25T17:57:10Z) - Trajectory-based Reinforcement Learning of Non-prehensile Manipulation
Skills for Semi-Autonomous Teleoperation [18.782289957834475]
We present a semi-autonomous teleoperation framework for a pick-and-place task using an RGB-D sensor.
A trajectory-based reinforcement learning is utilized for learning the non-prehensile manipulation to rearrange the objects.
We show that the proposed method outperforms manual keyboard control in terms of the time duration for the grasping.
arXiv Detail & Related papers (2021-09-27T14:27:28Z) - Learning Actor-centered Representations for Action Localization in
Streaming Videos using Predictive Learning [18.757368441841123]
Event perception tasks such as recognizing and localizing actions in streaming videos are essential for tackling visual understanding tasks.
We tackle the problem of learning textitactor-centered representations through the notion of continual hierarchical predictive learning.
Inspired by cognitive theories of event perception, we propose a novel, self-supervised framework.
arXiv Detail & Related papers (2021-04-29T06:06:58Z) - Weakly Supervised Temporal Action Localization Through Learning Explicit
Subspaces for Action and Context [151.23835595907596]
Methods learn to localize temporal starts and ends of action instances in a video under only video-level supervision.
We introduce a framework that learns two feature subspaces respectively for actions and their context.
The proposed approach outperforms state-of-the-art WS-TAL methods on three benchmarks.
arXiv Detail & Related papers (2021-03-30T08:26:53Z) - Active Visual Localization in Partially Calibrated Environments [35.48595012305253]
Humans can robustly localize themselves without a map after they get lost following prominent visual cues or landmarks.
In this work, we aim at endowing autonomous agents the same ability. Such ability is important in robotics applications yet very challenging when an agent is exposed to partially calibrated environments.
We propose an indoor scene dataset ACR-6, which consists of both synthetic and real data and simulates challenging scenarios for active visual localization.
arXiv Detail & Related papers (2020-12-08T08:00:55Z) - Unsupervised Domain Adaptation for Spatio-Temporal Action Localization [69.12982544509427]
S-temporal action localization is an important problem in computer vision.
We propose an end-to-end unsupervised domain adaptation algorithm.
We show that significant performance gain can be achieved when spatial and temporal features are adapted separately or jointly.
arXiv Detail & Related papers (2020-10-19T04:25:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.