Learning Higher-order Object Interactions for Keypoint-based Video
Understanding
- URL: http://arxiv.org/abs/2305.09539v1
- Date: Tue, 16 May 2023 15:30:33 GMT
- Title: Learning Higher-order Object Interactions for Keypoint-based Video
Understanding
- Authors: Yi Huang, Asim Kadav, Farley Lai, Deep Patel, Hans Peter Graf
- Abstract summary: We describe an action-localization method, KeyNet, that uses only the keypoint data for tracking and action recognition.
We find that KeyNet is able to track and classify human actions at just 5 FPS.
- Score: 15.52736059969859
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Action recognition is an important problem that requires identifying actions
in video by learning complex interactions across scene actors and objects.
However, modern deep-learning based networks often require significant
computation, and may capture scene context using various modalities that
further increases compute costs. Efficient methods such as those used for AR/VR
often only use human-keypoint information but suffer from a loss of scene
context that hurts accuracy. In this paper, we describe an action-localization
method, KeyNet, that uses only the keypoint data for tracking and action
recognition. Specifically, KeyNet introduces the use of object based keypoint
information to capture context in the scene. Our method illustrates how to
build a structured intermediate representation that allows modeling
higher-order interactions in the scene from object and human keypoints without
using any RGB information. We find that KeyNet is able to track and classify
human actions at just 5 FPS. More importantly, we demonstrate that object
keypoints can be modeled to recover any loss in context from using keypoint
information over AVA action and Kinetics datasets.
Related papers
- Holistic Understanding of 3D Scenes as Universal Scene Description [56.69740649781989]
3D scene understanding is a long-standing challenge in computer vision and a key component in enabling mixed reality, wearable computing, and embodied AI.
We introduce an expertly curated dataset in the Universal Scene Description (USD) format featuring high-quality manual annotations.
With its broad and high-quality annotations, the data provides the basis for holistic 3D scene understanding models.
arXiv Detail & Related papers (2024-12-02T11:33:55Z) - Keypoint Abstraction using Large Models for Object-Relative Imitation Learning [78.92043196054071]
Generalization to novel object configurations and instances across diverse tasks and environments is a critical challenge in robotics.
Keypoint-based representations have been proven effective as a succinct representation for essential object capturing features.
We propose KALM, a framework that leverages large pre-trained vision-language models to automatically generate task-relevant and cross-instance consistent keypoints.
arXiv Detail & Related papers (2024-10-30T17:37:31Z) - Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers [65.51132104404051]
We introduce the use of object identifiers and object-centric representations to interact with scenes at the object level.
Our model significantly outperforms existing methods on benchmarks including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.
arXiv Detail & Related papers (2023-12-13T14:27:45Z) - Learning Feature Matching via Matchable Keypoint-Assisted Graph Neural
Network [52.29330138835208]
Accurately matching local features between a pair of images is a challenging computer vision task.
Previous studies typically use attention based graph neural networks (GNNs) with fully-connected graphs over keypoints within/across images.
We propose MaKeGNN, a sparse attention-based GNN architecture which bypasses non-repeatable keypoints and leverages matchable ones to guide message passing.
arXiv Detail & Related papers (2023-07-04T02:50:44Z) - Spot What Matters: Learning Context Using Graph Convolutional Networks
for Weakly-Supervised Action Detection [0.0]
We introduce an architecture based on self-attention and Convolutional Networks to improve human action detection in video.
Our model aids explainability by visualizing the learned context as an attention map, even for actions and objects unseen during training.
Experimental results show that our contextualized approach outperforms a baseline action detection approach by more than 2 points in Video-mAP.
arXiv Detail & Related papers (2021-07-28T21:37:18Z) - End-to-End Learning of Keypoint Representations for Continuous Control
from Images [84.8536730437934]
We show that it is possible to learn efficient keypoint representations end-to-end, without the need for unsupervised pre-training, decoders, or additional losses.
Our proposed architecture consists of a differentiable keypoint extractor that feeds the coordinates directly to a soft actor-critic agent.
arXiv Detail & Related papers (2021-06-15T09:17:06Z) - Where2Act: From Pixels to Actions for Articulated 3D Objects [54.19638599501286]
We extract highly localized actionable information related to elementary actions such as pushing or pulling for articulated objects with movable parts.
We propose a learning-from-interaction framework with an online data sampling strategy that allows us to train the network in simulation.
Our learned models even transfer to real-world data.
arXiv Detail & Related papers (2021-01-07T18:56:38Z) - Unsupervised Object Keypoint Learning using Local Spatial Predictability [10.862430265350804]
We propose PermaKey, a novel approach to representation learning based on object keypoints.
We demonstrate the efficacy of PermaKey on Atari where it learns keypoints corresponding to the most salient object parts and is robust to certain visual distractors.
arXiv Detail & Related papers (2020-11-25T18:27:05Z) - S3K: Self-Supervised Semantic Keypoints for Robotic Manipulation via
Multi-View Consistency [11.357804868755155]
We advocate semantic 3D keypoints as a visual representation, and present a semi-supervised training objective.
Unlike local texture-based approaches, our model integrates contextual information from a large area.
We demonstrate that this ability to locate semantic keypoints enables high level scripting of human understandable behaviours.
arXiv Detail & Related papers (2020-09-30T14:44:54Z) - CoKe: Localized Contrastive Learning for Robust Keypoint Detection [24.167397429511915]
We show that keypoint kernels can be chosen to optimize three types of distances in the feature space.
We formulate this optimization process within a framework, which includes supervised contrastive learning.
CoKe achieves state-of-the-art results compared to approaches that jointly represent all keypoints holistically.
arXiv Detail & Related papers (2020-09-29T16:00:43Z) - A Deep Learning Approach to Object Affordance Segmentation [31.221897360610114]
We design an autoencoder that infers pixel-wise affordance labels in both videos and static images.
Our model surpasses the need for object labels and bounding boxes by using a soft-attention mechanism.
We show that our model achieves competitive results compared to strongly supervised methods on SOR3D-AFF.
arXiv Detail & Related papers (2020-04-18T15:34:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.