Discovering a Variety of Objects in Spatio-Temporal Human-Object
Interactions
- URL: http://arxiv.org/abs/2211.07501v1
- Date: Mon, 14 Nov 2022 16:33:54 GMT
- Title: Discovering a Variety of Objects in Spatio-Temporal Human-Object
Interactions
- Authors: Yong-Lu Li, Hongwei Fan, Zuoyu Qiu, Yiming Dou, Liang Xu, Hao-Shu
Fang, Peiyang Guo, Haisheng Su, Dongliang Wang, Wei Wu, Cewu Lu
- Abstract summary: In daily HOIs, humans often interact with a variety of objects, e.g., holding and touching dozens of household items in cleaning.
Here, we introduce a new benchmark based on AVA: Discoveringed Objects (DIO) including 51 interactions and 1,000+ objects.
An ST-HOI learning task is proposed expecting vision systems to track human actors, detect interactions and simultaneously discover objects.
- Score: 45.92485321148352
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Spatio-temporal Human-Object Interaction (ST-HOI) detection aims at detecting
HOIs from videos, which is crucial for activity understanding. In daily HOIs,
humans often interact with a variety of objects, e.g., holding and touching
dozens of household items in cleaning. However, existing whole body-object
interaction video benchmarks usually provide limited object classes. Here, we
introduce a new benchmark based on AVA: Discovering Interacted Objects (DIO)
including 51 interactions and 1,000+ objects. Accordingly, an ST-HOI learning
task is proposed expecting vision systems to track human actors, detect
interactions and simultaneously discover interacted objects. Even though
today's detectors/trackers excel in object detection/tracking tasks, they
perform unsatisfied to localize diverse/unseen objects in DIO. This profoundly
reveals the limitation of current vision systems and poses a great challenge.
Thus, how to leverage spatio-temporal cues to address object discovery is
explored, and a Hierarchical Probe Network (HPN) is devised to discover
interacted objects utilizing hierarchical spatio-temporal human/context cues.
In extensive experiments, HPN demonstrates impressive performance. Data and
code are available at https://github.com/DirtyHarryLYL/HAKE-AVA.
Related papers
- AffordanceLLM: Grounding Affordance from Vision Language Models [36.97072698640563]
Affordance grounding refers to the task of finding the area of an object with which one can interact.
Much of the knowledge is hidden and beyond the image content with the supervised labels from a limited training set.
We make an attempt to improve the generalization capability of the current affordance grounding by taking the advantage of the rich world, abstract, and human-object-interaction knowledge.
arXiv Detail & Related papers (2024-01-12T03:21:02Z) - InterTracker: Discovering and Tracking General Objects Interacting with
Hands in the Wild [40.489171608114574]
Existing methods rely on frame-based detectors to locate interacting objects.
We propose to leverage hand-object interaction to track interactive objects.
Our proposed method outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2023-08-06T09:09:17Z) - Object-agnostic Affordance Categorization via Unsupervised Learning of
Graph Embeddings [6.371828910727037]
Acquiring knowledge about object interactions and affordances can facilitate scene understanding and human-robot collaboration tasks.
We address the problem of affordance categorization for class-agnostic objects with an open set of interactions.
A novel depth-informed qualitative spatial representation is proposed for the construction of Activity Graphs.
arXiv Detail & Related papers (2023-03-30T15:04:04Z) - Learn to Predict How Humans Manipulate Large-sized Objects from
Interactive Motions [82.90906153293585]
We propose a graph neural network, HO-GCN, to fuse motion data and dynamic descriptors for the prediction task.
We show the proposed network that consumes dynamic descriptors can achieve state-of-the-art prediction results and help the network better generalize to unseen objects.
arXiv Detail & Related papers (2022-06-25T09:55:39Z) - INVIGORATE: Interactive Visual Grounding and Grasping in Clutter [56.00554240240515]
INVIGORATE is a robot system that interacts with human through natural language and grasps a specified object in clutter.
We train separate neural networks for object detection, for visual grounding, for question generation, and for OBR detection and grasping.
We build a partially observable Markov decision process (POMDP) that integrates the learned neural network modules.
arXiv Detail & Related papers (2021-08-25T07:35:21Z) - Detecting Human-Object Interaction via Fabricated Compositional Learning [106.37536031160282]
Human-Object Interaction (HOI) detection is a fundamental task for high-level scene understanding.
Human has extremely powerful compositional perception ability to cognize rare or unseen HOI samples.
We propose Fabricated Compositional Learning (FCL) to address the problem of open long-tailed HOI detection.
arXiv Detail & Related papers (2021-03-15T08:52:56Z) - Human-Object Interaction Detection:A Quick Survey and Examination of
Methods [17.8805983491991]
This is the first general survey of the state-of-the-art and milestone works in this field.
We provide a basic survey of the developments in the field of human-object interaction detection.
We examine the HORCNN architecture as it is a foundational work in the field.
arXiv Detail & Related papers (2020-09-27T20:58:39Z) - ConsNet: Learning Consistency Graph for Zero-Shot Human-Object
Interaction Detection [101.56529337489417]
We consider the problem of Human-Object Interaction (HOI) Detection, which aims to locate and recognize HOI instances in the form of human, action, object> in images.
We argue that multi-level consistencies among objects, actions and interactions are strong cues for generating semantic representations of rare or previously unseen HOIs.
Our model takes visual features of candidate human-object pairs and word embeddings of HOI labels as inputs, maps them into visual-semantic joint embedding space and obtains detection results by measuring their similarities.
arXiv Detail & Related papers (2020-08-14T09:11:18Z) - Learning Human-Object Interaction Detection using Interaction Points [140.0200950601552]
We propose a novel fully-convolutional approach that directly detects the interactions between human-object pairs.
Our network predicts interaction points, which directly localize and classify the inter-action.
Experiments are performed on two popular benchmarks: V-COCO and HICO-DET.
arXiv Detail & Related papers (2020-03-31T08:42:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.