Improved Actor Relation Graph based Group Activity Recognition
- URL: http://arxiv.org/abs/2010.12968v2
- Date: Tue, 29 Dec 2020 16:56:54 GMT
- Title: Improved Actor Relation Graph based Group Activity Recognition
- Authors: Zijian Kuang and Xinran Tie
- Abstract summary: The detailed description of human actions and group activities is essential information, which can be used in real-time CCTV video surveillance, health care, sports video analysis, etc.
This study proposes a video understanding method that mainly focused on group activity recognition by learning the pair-wise actor appearance similarity and actor positions.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video understanding is to recognize and classify different actions or
activities appearing in the video. A lot of previous work, such as video
captioning, has shown promising performance in producing general video
understanding. However, it is still challenging to generate a fine-grained
description of human actions and their interactions using state-of-the-art
video captioning techniques. The detailed description of human actions and
group activities is essential information, which can be used in real-time CCTV
video surveillance, health care, sports video analysis, etc. This study
proposes a video understanding method that mainly focused on group activity
recognition by learning the pair-wise actor appearance similarity and actor
positions. We propose to use Normalized cross-correlation (NCC) and the sum of
absolute differences (SAD) to calculate the pair-wise appearance similarity and
build the actor relationship graph to allow the graph convolution network to
learn how to classify group activities. We also propose to use MobileNet as the
backbone to extract features from each video frame. A visualization model is
further introduced to visualize each input video frame with predicted bounding
boxes on each human object and predict individual action and collective
activity.
Related papers
- VrdONE: One-stage Video Visual Relation Detection [30.983521962897477]
Video Visual Relation Detection (VidVRD) focuses on understanding how entities over time and space in videos.
Traditional methods for VidVRD, challenged by its complexity, typically split the task into two parts: one for identifying what relation are present and another for determining their temporal boundaries.
We propose VrdONE, a streamlined yet efficacious one-stage model for VidVRD.
arXiv Detail & Related papers (2024-08-18T08:38:20Z) - Gaze-Guided Graph Neural Network for Action Anticipation Conditioned on Intention [10.149523817328921]
We introduce the Gaze-guided Action Anticipation algorithm, which establishes a visual-semantic graph from the video input.
Our method utilizes a Graph Neural Network to recognize the agent's intention and predict the action sequence to fulfill this intention.
Our method outperforms state-of-the-art techniques, achieving a 7% improvement in accuracy for 18-class intention recognition.
arXiv Detail & Related papers (2024-04-10T21:03:23Z) - Video alignment using unsupervised learning of local and global features [0.0]
We introduce an unsupervised method for alignment that uses global and local features of the frames.
In particular, we introduce effective features for each video frame by means of three machine vision tools: person detection, pose estimation, and VGG network.
The main advantage of our approach is that no training is required, which makes it applicable for any new type of action without any need to collect training samples for it.
arXiv Detail & Related papers (2023-04-13T22:20:54Z) - Bridge-Prompt: Towards Ordinal Action Understanding in Instructional
Videos [92.18898962396042]
We propose a prompt-based framework, Bridge-Prompt, to model the semantics across adjacent actions.
We reformulate the individual action labels as integrated text prompts for supervision, which bridge the gap between individual action semantics.
Br-Prompt achieves state-of-the-art on multiple benchmarks.
arXiv Detail & Related papers (2022-03-26T15:52:27Z) - Part-level Action Parsing via a Pose-guided Coarse-to-Fine Framework [108.70949305791201]
Part-level Action Parsing (PAP) aims to not only predict the video-level action but also recognize the frame-level fine-grained actions or interactions of body parts for each person in the video.
In particular, our framework first predicts the video-level class of the input video, then localizes the body parts and predicts the part-level action.
Our framework achieves state-of-the-art performance and outperforms existing methods over a 31.10% ROC score.
arXiv Detail & Related papers (2022-03-09T01:30:57Z) - COMPOSER: Compositional Learning of Group Activity in Videos [33.526331969279106]
Group Activity Recognition (GAR) detects the activity performed by a group of actors in a short video clip.
We propose COMPOSER, a Multiscale Transformer based architecture that performs attention-based reasoning over tokens at each scale.
COMPOSER achieves a new state-of-the-art 94.5% accuracy with the keypoint-only modality.
arXiv Detail & Related papers (2021-12-11T01:25:46Z) - Efficient Modelling Across Time of Human Actions and Interactions [92.39082696657874]
We argue that current fixed-sized-temporal kernels in 3 convolutional neural networks (CNNDs) can be improved to better deal with temporal variations in the input.
We study how we can better handle between classes of actions, by enhancing their feature differences over different layers of the architecture.
The proposed approaches are evaluated on several benchmark action recognition datasets and show competitive results.
arXiv Detail & Related papers (2021-10-05T15:39:11Z) - HighlightMe: Detecting Highlights from Human-Centric Videos [62.265410865423]
We present a domain- and user-preference-agnostic approach to detect highlightable excerpts from human-centric videos.
We use an autoencoder network equipped with spatial-temporal graph convolutions to detect human activities and interactions.
We observe a 4-12% improvement in the mean average precision of matching the human-annotated highlights over state-of-the-art methods.
arXiv Detail & Related papers (2021-10-05T01:18:15Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - CoCon: Cooperative-Contrastive Learning [52.342936645996765]
Self-supervised visual representation learning is key for efficient video analysis.
Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge.
We introduce a cooperative variant of contrastive learning to utilize complementary information across views.
arXiv Detail & Related papers (2021-04-30T05:46:02Z) - Action Graphs: Weakly-supervised Action Localization with Graph
Convolution Networks [25.342482374259017]
We present a method for weakly-supervised action localization based on graph convolutions.
Our method utilizes similarity graphs that encode appearance and motion, and pushes the state of the art on THUMOS '14, ActivityNet 1.2, and Charades for weakly supervised action localization.
arXiv Detail & Related papers (2020-02-04T18:21:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.