Adding Knowledge to Unsupervised Algorithms for the Recognition of
Intent
- URL: http://arxiv.org/abs/2011.06219v1
- Date: Thu, 12 Nov 2020 05:57:09 GMT
- Title: Adding Knowledge to Unsupervised Algorithms for the Recognition of
Intent
- Authors: Stuart Synakowski, Qianli Feng, Aleix Martinez
- Abstract summary: We derive an algorithm that can infer whether the behavior of an agent in a scene is intentional or unintentional based on its 3D kinematics.
We show how the addition of this basic knowledge leads to a simple, unsupervised algorithm.
Experiments on these datasets show that our algorithm can recognize whether an action is intentional or not, even without training data.
- Score: 3.0079490585515343
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Computer vision algorithms performance are near or superior to humans in the
visual problems including object recognition (especially those of fine-grained
categories), segmentation, and 3D object reconstruction from 2D views. Humans
are, however, capable of higher-level image analyses. A clear example,
involving theory of mind, is our ability to determine whether a perceived
behavior or action was performed intentionally or not. In this paper, we derive
an algorithm that can infer whether the behavior of an agent in a scene is
intentional or unintentional based on its 3D kinematics, using the knowledge of
self-propelled motion, Newtonian motion and their relationship. We show how the
addition of this basic knowledge leads to a simple, unsupervised algorithm. To
test the derived algorithm, we constructed three dedicated datasets from
abstract geometric animation to realistic videos of agents performing
intentional and non-intentional actions. Experiments on these datasets show
that our algorithm can recognize whether an action is intentional or not, even
without training data. The performance is comparable to various supervised
baselines quantitatively, with sensible intentionality segmentation
qualitatively.
Related papers
- Human-level 3D shape perception emerges from multi-view learning [63.048728487674815]
We develop a modeling framework that predicts human 3D shape inferences for arbitrary objects.<n>We achieve this with a novel class of neural networks trained using a visual-spatial objective over naturalistic sensory data.<n>We find that human-level 3D perception can emerge from a simple, scalable learning objective over naturalistic visual-spatial data.
arXiv Detail & Related papers (2026-02-19T18:56:05Z) - Object-centric 3D Motion Field for Robot Learning from Human Videos [56.9436352861611]
We propose to use object-centric 3D motion field to represent actions for robot learning from human videos.<n>We present a novel framework for extracting this representation from videos for zero-shot control.<n> Experiments show that our method reduces 3D motion estimation error by over 50% compared to the latest method.
arXiv Detail & Related papers (2025-06-04T17:59:06Z) - Offline Imitation Learning Through Graph Search and Retrieval [57.57306578140857]
Imitation learning is a powerful machine learning algorithm for a robot to acquire manipulation skills.
We propose GSR, a simple yet effective algorithm that learns from suboptimal demonstrations through Graph Search and Retrieval.
GSR can achieve a 10% to 30% higher success rate and over 30% higher proficiency compared to baselines.
arXiv Detail & Related papers (2024-07-22T06:12:21Z) - Intent3D: 3D Object Detection in RGB-D Scans Based on Human Intention [86.39271731460927]
3D intention grounding is a new task in 3D object detection employing RGB-D, based on human intention, such as "I want something to support my back"
We introduce the new Intent3D dataset, consisting of 44,990 intention texts associated with 209 fine-grained classes from 1,042 scenes of the ScanNet dataset.
We also propose IntentNet, our unique approach, designed to tackle this intention-based detection problem.
arXiv Detail & Related papers (2024-05-28T15:48:39Z) - Gaze-Guided Graph Neural Network for Action Anticipation Conditioned on Intention [10.149523817328921]
We introduce the Gaze-guided Action Anticipation algorithm, which establishes a visual-semantic graph from the video input.
Our method utilizes a Graph Neural Network to recognize the agent's intention and predict the action sequence to fulfill this intention.
Our method outperforms state-of-the-art techniques, achieving a 7% improvement in accuracy for 18-class intention recognition.
arXiv Detail & Related papers (2024-04-10T21:03:23Z) - Explaining Deep Face Algorithms through Visualization: A Survey [57.60696799018538]
This work undertakes a first-of-its-kind meta-analysis of explainability algorithms in the face domain.
We review existing face explainability works and reveal valuable insights into the structure and hierarchy of face networks.
arXiv Detail & Related papers (2023-09-26T07:16:39Z) - ALSO: Automotive Lidar Self-supervision by Occupancy estimation [70.70557577874155]
We propose a new self-supervised method for pre-training the backbone of deep perception models operating on point clouds.
The core idea is to train the model on a pretext task which is the reconstruction of the surface on which the 3D points are sampled.
The intuition is that if the network is able to reconstruct the scene surface, given only sparse input points, then it probably also captures some fragments of semantic information.
arXiv Detail & Related papers (2022-12-12T13:10:19Z) - Modeling human intention inference in continuous 3D domains by inverse
planning and body kinematics [31.421686048250827]
We describe a computational framework for evaluating models of goal inference in the domain of 3D motor actions.
We evaluate our framework in three behavioural experiments using a novel Target Reaching Task, in which human observers infer intentions of actors reaching for targets among distracts.
We show that human observers indeed rely on inverse body kinematics in such scenarios, suggesting that modeling body kinematic can improve performance of inference algorithms.
arXiv Detail & Related papers (2021-12-02T00:55:58Z) - Spot What Matters: Learning Context Using Graph Convolutional Networks
for Weakly-Supervised Action Detection [0.0]
We introduce an architecture based on self-attention and Convolutional Networks to improve human action detection in video.
Our model aids explainability by visualizing the learned context as an attention map, even for actions and objects unseen during training.
Experimental results show that our contextualized approach outperforms a baseline action detection approach by more than 2 points in Video-mAP.
arXiv Detail & Related papers (2021-07-28T21:37:18Z) - Synergies Between Affordance and Geometry: 6-DoF Grasp Detection via
Implicit Representations [20.155920256334706]
We show that 3D reconstruction and grasp learning are two intimately connected tasks.
We propose to utilize the synergies between grasp affordance and 3D reconstruction through multi-task learning of a shared representation.
Our method outperforms baselines by over 10% in terms of grasp success rate.
arXiv Detail & Related papers (2021-04-04T05:46:37Z) - Self-Supervision by Prediction for Object Discovery in Videos [62.87145010885044]
In this paper, we use the prediction task as self-supervision and build a novel object-centric model for image sequence representation.
Our framework can be trained without the help of any manual annotation or pretrained network.
Initial experiments confirm that the proposed pipeline is a promising step towards object-centric video prediction.
arXiv Detail & Related papers (2021-03-09T19:14:33Z) - "What's This?" -- Learning to Segment Unknown Objects from Manipulation
Sequences [27.915309216800125]
We present a novel framework for self-supervised grasped object segmentation with a robotic manipulator.
We propose a single, end-to-end trainable architecture which jointly incorporates motion cues and semantic knowledge.
Our method neither depends on any visual registration of a kinematic robot or 3D object models, nor on precise hand-eye calibration or any additional sensor data.
arXiv Detail & Related papers (2020-11-06T10:55:28Z) - Memory-augmented Dense Predictive Coding for Video Representation
Learning [103.69904379356413]
We propose a new architecture and learning framework Memory-augmented Predictive Coding (MemDPC) for the task.
We investigate visual-only self-supervised video representation learning from RGB frames, or from unsupervised optical flow, or both.
In all cases, we demonstrate state-of-the-art or comparable performance over other approaches with orders of magnitude fewer training data.
arXiv Detail & Related papers (2020-08-03T17:57:01Z) - Unsupervised 3D Human Pose Representation with Viewpoint and Pose
Disentanglement [63.853412753242615]
Learning a good 3D human pose representation is important for human pose related tasks.
We propose a novel Siamese denoising autoencoder to learn a 3D pose representation.
Our approach achieves state-of-the-art performance on two inherently different tasks.
arXiv Detail & Related papers (2020-07-14T14:25:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.