Automatic Interaction and Activity Recognition from Videos of Human
Manual Demonstrations with Application to Anomaly Detection
- URL: http://arxiv.org/abs/2304.09789v2
- Date: Fri, 7 Jul 2023 08:31:03 GMT
- Title: Automatic Interaction and Activity Recognition from Videos of Human
Manual Demonstrations with Application to Anomaly Detection
- Authors: Elena Merlo (1, 2), Marta Lagomarsino (1, 3), Edoardo Lamon (1, 4),
Arash Ajoudani (1) ((1) Human-Robot Interfaces and Interaction Laboratory,
Istituto Italiano di Tecnologia, Genoa, Italy, (2) Dept. of Informatics,
Bioengineering, Robotics, and Systems Engineering, University of Genoa,
Genoa, Italy, (3) Dept. of Electronics, Information and Bioengineering,
Politecnico di Milano, Milan, Italy, (4) Dept. of Information Engineering and
Computer Science, University of Trento, Trento, Italy)
- Abstract summary: This paper exploits Scene Graphs to extract key interaction features from image sequences while simultaneously motion patterns and context.
The method introduces event-based automatic video segmentation and clustering, which allow for the grouping of similar events and detect if a monitored activity is executed correctly.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a new method to describe spatio-temporal relations
between objects and hands, to recognize both interactions and activities within
video demonstrations of manual tasks. The approach exploits Scene Graphs to
extract key interaction features from image sequences while simultaneously
encoding motion patterns and context. Additionally, the method introduces
event-based automatic video segmentation and clustering, which allow for the
grouping of similar events and detect if a monitored activity is executed
correctly. The effectiveness of the approach was demonstrated in two
multi-subject experiments, showing the ability to recognize and cluster
hand-object and object-object interactions without prior knowledge of the
activity, as well as matching the same activity performed by different
subjects.
Related papers
- Visual-Geometric Collaborative Guidance for Affordance Learning [63.038406948791454]
We propose a visual-geometric collaborative guided affordance learning network that incorporates visual and geometric cues.
Our method outperforms the representative models regarding objective metrics and visual quality.
arXiv Detail & Related papers (2024-10-15T07:35:51Z) - Disentangled Interaction Representation for One-Stage Human-Object
Interaction Detection [70.96299509159981]
Human-Object Interaction (HOI) detection is a core task for human-centric image understanding.
Recent one-stage methods adopt a transformer decoder to collect image-wide cues that are useful for interaction prediction.
Traditional two-stage methods benefit significantly from their ability to compose interaction features in a disentangled and explainable manner.
arXiv Detail & Related papers (2023-12-04T08:02:59Z) - Multi-Task Learning based Video Anomaly Detection with Attention [1.2944868613449219]
We propose a novel multi-task learning based method that combines complementary proxy tasks to better consider the motion and appearance features.
We combine the semantic segmentation and future frame prediction tasks in a single branch to learn the object class and consistent motion patterns.
In the second branch, we added several attention mechanisms to detect motion anomalies with attention to object parts, the direction of motion, and the distance of the objects from the camera.
arXiv Detail & Related papers (2022-10-14T10:40:20Z) - Audio-Adaptive Activity Recognition Across Video Domains [112.46638682143065]
We leverage activity sounds for domain adaptation as they have less variance across domains and can reliably indicate which activities are not happening.
We propose an audio-adaptive encoder and associated learning methods that discriminatively adjust the visual feature representation.
We also introduce the new task of actor shift, with a corresponding audio-visual dataset, to challenge our method with situations where the activity appearance changes dramatically.
arXiv Detail & Related papers (2022-03-27T08:15:20Z) - Hand-Object Interaction Reasoning [33.612083150296364]
We show that modelling two-handed interactions are critical for action recognition in ego-encoded video.
We propose an interaction reasoning network for modelling-temporal relationships between hands and objects in video.
arXiv Detail & Related papers (2022-01-13T11:53:12Z) - Skeleton-Based Mutually Assisted Interacted Object Localization and
Human Action Recognition [111.87412719773889]
We propose a joint learning framework for "interacted object localization" and "human action recognition" based on skeleton data.
Our method achieves the best or competitive performance with the state-of-the-art methods for human action recognition.
arXiv Detail & Related papers (2021-10-28T10:09:34Z) - The Object at Hand: Automated Editing for Mixed Reality Video Guidance
from Hand-Object Interactions [24.68535915849555]
We use egocentric vision to observe hand-object interactions in real-world tasks and automatically decompose a video into its constituent steps.
Our approach combines hand-object interaction (HOI) detection, object similarity measurement and a finite state machine (FSM) representation to automatically edit videos into steps.
arXiv Detail & Related papers (2021-09-29T22:24:25Z) - Motion Guided Attention Fusion to Recognize Interactions from Videos [40.1565059238891]
We present a dual-pathway approach for recognizing fine-grained interactions from videos.
We fuse the bottom-up features in the motion pathway with features captured from object detections to learn the temporal aspects of an action.
We show that our approach can generalize across appearance effectively and recognize actions where an actor interacts with previously unseen objects.
arXiv Detail & Related papers (2021-04-01T17:44:34Z) - Learning Asynchronous and Sparse Human-Object Interaction in Videos [56.73059840294019]
Asynchronous-Sparse Interaction Graph Networks (ASSIGN) is able to automatically detect the structure of interaction events associated with entities in a video scene.
ASSIGN is tested on human-object interaction recognition and shows superior performance in segmenting and labeling of human sub-activities and object affordances from raw videos.
arXiv Detail & Related papers (2021-03-03T23:43:55Z) - Improved Actor Relation Graph based Group Activity Recognition [0.0]
The detailed description of human actions and group activities is essential information, which can be used in real-time CCTV video surveillance, health care, sports video analysis, etc.
This study proposes a video understanding method that mainly focused on group activity recognition by learning the pair-wise actor appearance similarity and actor positions.
arXiv Detail & Related papers (2020-10-24T19:46:49Z) - Learning Modality Interaction for Temporal Sentence Localization and
Event Captioning in Videos [76.21297023629589]
We propose a novel method for learning pairwise modality interactions in order to better exploit complementary information for each pair of modalities in videos.
Our method turns out to achieve state-of-the-art performances on four standard benchmark datasets.
arXiv Detail & Related papers (2020-07-28T12:40:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.