Related papers: Motion Guided Attention Fusion to Recognize Interactions from Videos

Motion Guided Attention Fusion to Recognize Interactions from Videos

URL: http://arxiv.org/abs/2104.00646v1
Date: Thu, 1 Apr 2021 17:44:34 GMT
Title: Motion Guided Attention Fusion to Recognize Interactions from Videos
Authors: Tae Soo Kim, Jonathan Jones, Gregory D. Hager
Abstract summary: We present a dual-pathway approach for recognizing fine-grained interactions from videos. We fuse the bottom-up features in the motion pathway with features captured from object detections to learn the temporal aspects of an action. We show that our approach can generalize across appearance effectively and recognize actions where an actor interacts with previously unseen objects.
Score: 40.1565059238891
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present a dual-pathway approach for recognizing fine-grained interactions from videos. We build on the success of prior dual-stream approaches, but make a distinction between the static and dynamic representations of objects and their interactions explicit by introducing separate motion and object detection pathways. Then, using our new Motion-Guided Attention Fusion module, we fuse the bottom-up features in the motion pathway with features captured from object detections to learn the temporal aspects of an action. We show that our approach can generalize across appearance effectively and recognize actions where an actor interacts with previously unseen objects. We validate our approach using the compositional action recognition task from the Something-Something-v2 dataset where we outperform existing state-of-the-art methods. We also show that our method can generalize well to real world tasks by showing state-of-the-art performance on recognizing humans assembling various IKEA furniture on the IKEA-ASM dataset.

Related papers

Uncertainty-Guided Appearance-Motion Association Network for Out-of-Distribution Action Detection [4.938957922033169]
Out-of-distribution (OOD) detection targets to detect and reject test samples with semantic shifts. We propose a novel Uncertainty-Guided Appearance-Motion Association Network (UAAN) We show that UAAN beats state-of-the-art methods by a significant margin, illustrating its effectiveness.
arXiv Detail & Related papers (2024-09-16T02:53:49Z)
Simultaneous Detection and Interaction Reasoning for Object-Centric Action Recognition [21.655278000690686]
We propose an end-to-end object-centric action recognition framework. It simultaneously performs Detection And Interaction Reasoning in one stage. We conduct experiments on two datasets, Something-Else and Ikea-Assembly.
arXiv Detail & Related papers (2024-04-18T05:06:12Z)
Disentangled Interaction Representation for One-Stage Human-Object Interaction Detection [70.96299509159981]
Human-Object Interaction (HOI) detection is a core task for human-centric image understanding. Recent one-stage methods adopt a transformer decoder to collect image-wide cues that are useful for interaction prediction. Traditional two-stage methods benefit significantly from their ability to compose interaction features in a disentangled and explainable manner.
arXiv Detail & Related papers (2023-12-04T08:02:59Z)
Automatic Interaction and Activity Recognition from Videos of Human Manual Demonstrations with Application to Anomaly Detection [0.0]
This paper exploits Scene Graphs to extract key interaction features from image sequences while simultaneously motion patterns and context. The method introduces event-based automatic video segmentation and clustering, which allow for the grouping of similar events and detect if a monitored activity is executed correctly.
arXiv Detail & Related papers (2023-04-19T16:15:23Z)
SOS! Self-supervised Learning Over Sets Of Handled Objects In Egocentric Action Recognition [35.4163266882568]
We introduce Self-Supervised Learning Over Sets (SOS) to pre-train a generic Objects In Contact (OIC) representation model. Our OIC significantly boosts the performance of multiple state-of-the-art video classification models.
arXiv Detail & Related papers (2022-04-10T23:27:19Z)
Skeleton-Based Mutually Assisted Interacted Object Localization and Human Action Recognition [111.87412719773889]
We propose a joint learning framework for "interacted object localization" and "human action recognition" based on skeleton data. Our method achieves the best or competitive performance with the state-of-the-art methods for human action recognition.
arXiv Detail & Related papers (2021-10-28T10:09:34Z)
Learning Visual Affordance Grounding from Demonstration Videos [76.46484684007706]
Affordance grounding aims to segment all possible interaction regions between people and objects from an image/video. We propose a Hand-aided Affordance Grounding Network (HAGNet) that leverages the aided clues provided by the position and action of the hand in demonstration videos.
arXiv Detail & Related papers (2021-08-12T11:45:38Z)
Self-supervised Video Object Segmentation by Motion Grouping [79.13206959575228]
We develop a computer vision system able to segment objects by exploiting motion cues. We introduce a simple variant of the Transformer to segment optical flow frames into primary objects and the background. We evaluate the proposed architecture on public benchmarks (DAVIS2016, SegTrackv2, and FBMS59)
arXiv Detail & Related papers (2021-04-15T17:59:32Z)
Pose And Joint-Aware Action Recognition [87.4780883700755]
We present a new model for joint-based action recognition, which first extracts motion features from each joint separately through a shared motion encoder. Our joint selector module re-weights the joint information to select the most discriminative joints for the task. We show large improvements over the current state-of-the-art joint-based approaches on JHMDB, HMDB, Charades, AVA action recognition datasets.
arXiv Detail & Related papers (2020-10-16T04:43:34Z)
Self-Supervised Joint Encoding of Motion and Appearance for First Person Action Recognition [19.93779132095822]
We argue that learning features jointly intertwine from these two information channels is beneficial. We propose a single stream architecture able to do so, thanks to the addition of a self-supervised motion prediction block. Experiments on several publicly available databases show the power of our approach.
arXiv Detail & Related papers (2020-02-10T17:51:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.