DVANet: Disentangling View and Action Features for Multi-View Action
Recognition
- URL: http://arxiv.org/abs/2312.05719v1
- Date: Sun, 10 Dec 2023 01:19:48 GMT
- Title: DVANet: Disentangling View and Action Features for Multi-View Action
Recognition
- Authors: Nyle Siddiqui, Praveen Tirupattur, Mubarak Shah
- Abstract summary: We present a novel approach to multi-view action recognition where we guide learned action representations to be separated from view-relevant information in a video.
Our model and method of training significantly outperforms all other uni-modal models on four multi-view action recognition datasets.
- Score: 56.283944756315066
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we present a novel approach to multi-view action recognition
where we guide learned action representations to be separated from
view-relevant information in a video. When trying to classify action instances
captured from multiple viewpoints, there is a higher degree of difficulty due
to the difference in background, occlusion, and visibility of the captured
action from different camera angles. To tackle the various problems introduced
in multi-view action recognition, we propose a novel configuration of learnable
transformer decoder queries, in conjunction with two supervised contrastive
losses, to enforce the learning of action features that are robust to shifts in
viewpoints. Our disentangled feature learning occurs in two stages: the
transformer decoder uses separate queries to separately learn action and view
information, which are then further disentangled using our two contrastive
losses. We show that our model and method of training significantly outperforms
all other uni-modal models on four multi-view action recognition datasets: NTU
RGB+D, NTU RGB+D 120, PKU-MMD, and N-UCLA. Compared to previous RGB works, we
see maximal improvements of 1.5\%, 4.8\%, 2.2\%, and 4.8\% on each dataset,
respectively.
Related papers
- DOAD: Decoupled One Stage Action Detection Network [77.14883592642782]
Localizing people and recognizing their actions from videos is a challenging task towards high-level video understanding.
Existing methods are mostly two-stage based, with one stage for person bounding box generation and the other stage for action recognition.
We present a decoupled one-stage network dubbed DOAD, to improve the efficiency for-temporal action detection.
arXiv Detail & Related papers (2023-04-01T08:06:43Z) - Monocular Dynamic View Synthesis: A Reality Check [45.438135525140154]
We show a discrepancy between the practical capture process and the existing experimental protocols.
We define effective multi-view factors (EMFs) to quantify the amount of multi-view signal present in the input capture sequence.
We also propose a new iPhone dataset that includes more diverse real-life deformation sequences.
arXiv Detail & Related papers (2022-10-24T17:58:28Z) - Multi-dataset Training of Transformers for Robust Action Recognition [75.5695991766902]
We study the task of robust feature representations, aiming to generalize well on multiple datasets for action recognition.
Here, we propose a novel multi-dataset training paradigm, MultiTrain, with the design of two new loss terms, namely informative loss and projection loss.
We verify the effectiveness of our method on five challenging datasets, Kinetics-400, Kinetics-700, Moments-in-Time, Activitynet and Something-something-v2.
arXiv Detail & Related papers (2022-09-26T01:30:43Z) - Self-Supervised Video Representation Learning with Motion-Contrastive
Perception [13.860736711747284]
Motion-Contrastive Perception Network (MCPNet)
MCPNet consists of two branches, namely, Motion Information Perception (MIP) and Contrastive Instance Perception (CIP)
Our method outperforms current state-of-the-art visual-only self-supervised approaches.
arXiv Detail & Related papers (2022-04-10T05:34:46Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - Depth Guided Adaptive Meta-Fusion Network for Few-shot Video Recognition [86.31412529187243]
Few-shot video recognition aims at learning new actions with only very few labeled samples.
We propose a depth guided Adaptive Meta-Fusion Network for few-shot video recognition which is termed as AMeFu-Net.
arXiv Detail & Related papers (2020-10-20T03:06:20Z) - Memory-augmented Dense Predictive Coding for Video Representation
Learning [103.69904379356413]
We propose a new architecture and learning framework Memory-augmented Predictive Coding (MemDPC) for the task.
We investigate visual-only self-supervised video representation learning from RGB frames, or from unsupervised optical flow, or both.
In all cases, we demonstrate state-of-the-art or comparable performance over other approaches with orders of magnitude fewer training data.
arXiv Detail & Related papers (2020-08-03T17:57:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.