Self-Supervised Joint Encoding of Motion and Appearance for First Person
Action Recognition
- URL: http://arxiv.org/abs/2002.03982v2
- Date: Mon, 7 Dec 2020 18:50:33 GMT
- Title: Self-Supervised Joint Encoding of Motion and Appearance for First Person
Action Recognition
- Authors: Mirco Planamente, Andrea Bottino, Barbara Caputo
- Abstract summary: We argue that learning features jointly intertwine from these two information channels is beneficial.
We propose a single stream architecture able to do so, thanks to the addition of a self-supervised motion prediction block.
Experiments on several publicly available databases show the power of our approach.
- Score: 19.93779132095822
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Wearable cameras are becoming more and more popular in several applications,
increasing the interest of the research community in developing approaches for
recognizing actions from the first-person point of view. An open challenge in
egocentric action recognition is that videos lack detailed information about
the main actor's pose and thus tend to record only parts of the movement when
focusing on manipulation tasks. Thus, the amount of information about the
action itself is limited, making crucial the understanding of the manipulated
objects and their context. Many previous works addressed this issue with
two-stream architectures, where one stream is dedicated to modeling the
appearance of objects involved in the action, and another to extracting motion
features from optical flow. In this paper, we argue that learning features
jointly from these two information channels is beneficial to capture the
spatio-temporal correlations between the two better. To this end, we propose a
single stream architecture able to do so, thanks to the addition of a
self-supervised block that uses a pretext motion prediction task to intertwine
motion and appearance knowledge. Experiments on several publicly available
databases show the power of our approach.
Related papers
- Joint-Motion Mutual Learning for Pose Estimation in Videos [21.77871402339573]
Human pose estimation in videos has long been a compelling yet challenging task within the realm of computer vision.
Recent methods strive to integrate multi-frame visual features generated by a backbone network for pose estimation.
We propose a novel joint-motion mutual learning framework for pose estimation.
arXiv Detail & Related papers (2024-08-05T07:37:55Z) - Multi-view Action Recognition via Directed Gromov-Wasserstein Discrepancy [12.257725479880458]
Action recognition has become one of the popular research topics in computer vision.
We propose a multi-view attention consistency method that computes the similarity between two attentions from two different views of the action videos.
Our approach applies the idea of Neural Radiance Field to implicitly render the features from novel views when training on single-view datasets.
arXiv Detail & Related papers (2024-05-02T14:43:21Z) - Action Recognition with Multi-stream Motion Modeling and Mutual
Information Maximization [44.73161606369333]
Action recognition is a fundamental and intriguing problem in artificial intelligence.
We introduce a novel Stream-GCN network equipped with multi-stream components and channel attention.
Our approach sets the new state-of-the-art performance on three benchmark datasets.
arXiv Detail & Related papers (2023-06-13T06:56:09Z) - Learning to Refactor Action and Co-occurrence Features for Temporal
Action Localization [74.74339878286935]
Action features and co-occurrence features often dominate the actual action content in videos.
We develop a novel auxiliary task by decoupling these two types of features within a video snippet.
We term our method RefactorNet, which first explicitly factorizes the action content and regularizes its co-occurrence features.
arXiv Detail & Related papers (2022-06-23T06:30:08Z) - Exploring Motion and Appearance Information for Temporal Sentence
Grounding [52.01687915910648]
We propose a Motion-Appearance Reasoning Network (MARN) to solve temporal sentence grounding.
We develop separate motion and appearance branches to learn motion-guided and appearance-guided object relations.
Our proposed MARN significantly outperforms previous state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-01-03T02:44:18Z) - Efficient Modelling Across Time of Human Actions and Interactions [92.39082696657874]
We argue that current fixed-sized-temporal kernels in 3 convolutional neural networks (CNNDs) can be improved to better deal with temporal variations in the input.
We study how we can better handle between classes of actions, by enhancing their feature differences over different layers of the architecture.
The proposed approaches are evaluated on several benchmark action recognition datasets and show competitive results.
arXiv Detail & Related papers (2021-10-05T15:39:11Z) - JOKR: Joint Keypoint Representation for Unsupervised Cross-Domain Motion
Retargeting [53.28477676794658]
unsupervised motion in videos has seen substantial advancements through the use of deep neural networks.
We introduce JOKR - a JOint Keypoint Representation that handles both the source and target videos, without requiring any object prior or data collection.
We evaluate our method both qualitatively and quantitatively, and demonstrate that our method handles various cross-domain scenarios, such as different animals, different flowers, and humans.
arXiv Detail & Related papers (2021-06-17T17:32:32Z) - Self-supervised Video Object Segmentation by Motion Grouping [79.13206959575228]
We develop a computer vision system able to segment objects by exploiting motion cues.
We introduce a simple variant of the Transformer to segment optical flow frames into primary objects and the background.
We evaluate the proposed architecture on public benchmarks (DAVIS2016, SegTrackv2, and FBMS59)
arXiv Detail & Related papers (2021-04-15T17:59:32Z) - Motion Guided Attention Fusion to Recognize Interactions from Videos [40.1565059238891]
We present a dual-pathway approach for recognizing fine-grained interactions from videos.
We fuse the bottom-up features in the motion pathway with features captured from object detections to learn the temporal aspects of an action.
We show that our approach can generalize across appearance effectively and recognize actions where an actor interacts with previously unseen objects.
arXiv Detail & Related papers (2021-04-01T17:44:34Z) - Motion-Attentive Transition for Zero-Shot Video Object Segmentation [99.44383412488703]
We present a Motion-Attentive Transition Network (MATNet) for zero-shot object segmentation.
An asymmetric attention block, called Motion-Attentive Transition (MAT), is designed within a two-stream encoder.
In this way, the encoder becomes deeply internative, allowing for closely hierarchical interactions between object motion and appearance.
arXiv Detail & Related papers (2020-03-09T16:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.