Weakly Supervised Human-Object Interaction Detection in Video via
Contrastive Spatiotemporal Regions
- URL: http://arxiv.org/abs/2110.03562v1
- Date: Thu, 7 Oct 2021 15:30:18 GMT
- Title: Weakly Supervised Human-Object Interaction Detection in Video via
Contrastive Spatiotemporal Regions
- Authors: Shuang Li, Yilun Du, Antonio Torralba, Josef Sivic, and Bryan Russell
- Abstract summary: A system does not know what human-object interactions are present in a video as or the actual location of the human and object.
We introduce a dataset comprising over 6.5k videos with human-object interaction that have been curated from sentence captions.
We demonstrate improved performance over weakly supervised baselines adapted to our annotations on our video dataset.
- Score: 81.88294320397826
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce the task of weakly supervised learning for detecting human and
object interactions in videos. Our task poses unique challenges as a system
does not know what types of human-object interactions are present in a video or
the actual spatiotemporal location of the human and the object. To address
these challenges, we introduce a contrastive weakly supervised training loss
that aims to jointly associate spatiotemporal regions in a video with an action
and object vocabulary and encourage temporal continuity of the visual
appearance of moving objects as a form of self-supervision. To train our model,
we introduce a dataset comprising over 6.5k videos with human-object
interaction annotations that have been semi-automatically curated from sentence
captions associated with the videos. We demonstrate improved performance over
weakly supervised baselines adapted to our task on our video dataset.
Related papers
- AnchorCrafter: Animate CyberAnchors Saling Your Products via Human-Object Interacting Video Generation [40.81246588724407]
We introduce AnchorCrafter, a novel system designed to generate 2D videos featuring a target human and a customized object.
We propose two key innovations: the HOI-appearance perception, which enhances object appearance recognition from arbitrary multi-view perspectives, and the HOI-motion injection, which enables complex human-object interactions.
Our proposed system outperforms existing methods in preserving object appearance and shape awareness, while simultaneously maintaining consistency in human appearance and motion.
arXiv Detail & Related papers (2024-11-26T12:42:13Z) - I-MPN: Inductive Message Passing Network for Efficient Human-in-the-Loop Annotation of Mobile Eye Tracking Data [4.487146086221174]
We present a novel human-centered learning algorithm designed for automated object recognition within mobile eye-tracking settings.
Our approach seamlessly integrates an object detector with a spatial relation-aware inductive message-passing network (I-MPN), harnessing node profile information and capturing object correlations.
arXiv Detail & Related papers (2024-06-10T13:08:31Z) - Tragedy Plus Time: Capturing Unintended Human Activities from
Weakly-labeled Videos [31.1632730473261]
W-Oops consists of 2,100 unintentional human action videos, with 44 goal-directed and 30 unintentional video-level activity labels collected through human annotations.
We propose a weakly supervised algorithm for localizing the goal-directed as well as unintentional temporal regions in the video.
arXiv Detail & Related papers (2022-04-28T14:56:43Z) - SOS! Self-supervised Learning Over Sets Of Handled Objects In Egocentric
Action Recognition [35.4163266882568]
We introduce Self-Supervised Learning Over Sets (SOS) to pre-train a generic Objects In Contact (OIC) representation model.
Our OIC significantly boosts the performance of multiple state-of-the-art video classification models.
arXiv Detail & Related papers (2022-04-10T23:27:19Z) - Video Salient Object Detection via Contrastive Features and Attention
Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection.
A co-attention formulation is utilized to combine the low-level and high-level features.
We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z) - Spatio-Temporal Interaction Graph Parsing Networks for Human-Object
Interaction Recognition [55.7731053128204]
In given video-based Human-Object Interaction scene, modeling thetemporal relationship between humans and objects are the important cue to understand the contextual information presented in the video.
With the effective-temporal relationship modeling, it is possible not only to uncover contextual information in each frame but also directly capture inter-time dependencies.
The full use of appearance features, spatial location and the semantic information are also the key to improve the video-based Human-Object Interaction recognition performance.
arXiv Detail & Related papers (2021-08-19T11:57:27Z) - Learning Asynchronous and Sparse Human-Object Interaction in Videos [56.73059840294019]
Asynchronous-Sparse Interaction Graph Networks (ASSIGN) is able to automatically detect the structure of interaction events associated with entities in a video scene.
ASSIGN is tested on human-object interaction recognition and shows superior performance in segmenting and labeling of human sub-activities and object affordances from raw videos.
arXiv Detail & Related papers (2021-03-03T23:43:55Z) - Hybrid Dynamic-static Context-aware Attention Network for Action
Assessment in Long Videos [96.45804577283563]
We present a novel hybrid dynAmic-static Context-aware attenTION NETwork (ACTION-NET) for action assessment in long videos.
We learn the video dynamic information but also focus on the static postures of the detected athletes in specific frames.
We combine the features of the two streams to regress the final video score, supervised by ground-truth scores given by experts.
arXiv Detail & Related papers (2020-08-13T15:51:42Z) - Spatio-Temporal Graph for Video Captioning with Knowledge Distillation [50.034189314258356]
We propose a graph model for video captioning that exploits object interactions in space and time.
Our model builds interpretable links and is able to provide explicit visual grounding.
To avoid correlations caused by the variable number of objects, we propose an object-aware knowledge distillation mechanism.
arXiv Detail & Related papers (2020-03-31T03:58:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.