Adversarial Memory Networks for Action Prediction
- URL: http://arxiv.org/abs/2112.09875v1
- Date: Sat, 18 Dec 2021 08:16:21 GMT
- Title: Adversarial Memory Networks for Action Prediction
- Authors: Zhiqiang Tao, Yue Bai, Handong Zhao, Sheng Li, Yu Kong, Yun Fu
- Abstract summary: Action prediction aims to infer the forthcoming human action with partially-observed videos.
We propose adversarial memory networks (AMemNet) to generate the "full video" feature conditioning on a partial video query.
- Score: 95.09968654228372
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Action prediction aims to infer the forthcoming human action with
partially-observed videos, which is a challenging task due to the limited
information underlying early observations. Existing methods mainly adopt a
reconstruction strategy to handle this task, expecting to learn a single
mapping function from partial observations to full videos to facilitate the
prediction process. In this study, we propose adversarial memory networks
(AMemNet) to generate the "full video" feature conditioning on a partial video
query from two new aspects. Firstly, a key-value structured memory generator is
designed to memorize different partial videos as key memories and dynamically
write full videos in value memories with gating mechanism and querying
attention. Secondly, we develop a class-aware discriminator to guide the memory
generator to deliver not only realistic but also discriminative full video
features upon adversarial training. The final prediction result of AMemNet is
given by late fusion over RGB and optical flow streams. Extensive experimental
results on two benchmark video datasets, UCF-101 and HMDB51, are provided to
demonstrate the effectiveness of the proposed AMemNet model over
state-of-the-art methods.
Related papers
- Early Action Recognition with Action Prototypes [62.826125870298306]
We propose a novel model that learns a prototypical representation of the full action for each class.
We decompose the video into short clips, where a visual encoder extracts features from each clip independently.
Later, a decoder aggregates together in an online fashion features from all the clips for the final class prediction.
arXiv Detail & Related papers (2023-12-11T18:31:13Z) - Rich Action-semantic Consistent Knowledge for Early Action Prediction [20.866206453146898]
Early action prediction (EAP) aims to recognize human actions from a part of action execution in ongoing videos.
We partition original partial or full videos to form a new series of partial videos evolving in arbitrary progress levels.
A novel Rich Action-semantic Consistent Knowledge network (RACK) under the teacher-student framework is proposed for EAP.
arXiv Detail & Related papers (2022-01-23T03:39:31Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - Unsupervised Video Summarization with a Convolutional Attentive
Adversarial Network [32.90753137435032]
We propose a convolutional attentive adversarial network (CAAN) to build a deep summarizer in an unsupervised way.
Specifically, the generator employs a fully convolutional sequence network to extract global representation of a video, and an attention-based network to output normalized importance scores.
The results show the superiority of our proposed method against other state-of-the-art unsupervised approaches.
arXiv Detail & Related papers (2021-05-24T07:24:39Z) - Memory Enhanced Embedding Learning for Cross-Modal Video-Text Retrieval [155.32369959647437]
Cross-modal video-text retrieval is a challenging task in the field of vision and language.
Existing approaches for this task all focus on how to design encoding model through a hard negative ranking loss.
We propose a novel memory enhanced embedding learning (MEEL) method for videotext retrieval.
arXiv Detail & Related papers (2021-03-29T15:15:09Z) - Video SemNet: Memory-Augmented Video Semantic Network [14.64546899992196]
We propose a machine learning approach to capture the narrative elements in movies by bridging the gap between the low-level data representations and semantic aspects of the visual medium.
We present a Memory-Augmented Video Semantic Network, called Video SemNet, to encode the semantic descriptors and learn an embedding for the video.
We demonstrate that our model is able to predict genres and IMDB ratings with a weighted F-1 score of 0.72 and 0.63 respectively.
arXiv Detail & Related papers (2020-11-22T01:36:37Z) - Memory-augmented Dense Predictive Coding for Video Representation
Learning [103.69904379356413]
We propose a new architecture and learning framework Memory-augmented Predictive Coding (MemDPC) for the task.
We investigate visual-only self-supervised video representation learning from RGB frames, or from unsupervised optical flow, or both.
In all cases, we demonstrate state-of-the-art or comparable performance over other approaches with orders of magnitude fewer training data.
arXiv Detail & Related papers (2020-08-03T17:57:01Z) - Unsupervised Learning of Video Representations via Dense Trajectory
Clustering [86.45054867170795]
This paper addresses the task of unsupervised learning of representations for action recognition in videos.
We first propose to adapt two top performing objectives in this class - instance recognition and local aggregation.
We observe promising performance, but qualitative analysis shows that the learned representations fail to capture motion patterns.
arXiv Detail & Related papers (2020-06-28T22:23:03Z) - SummaryNet: A Multi-Stage Deep Learning Model for Automatic Video
Summarisation [0.0]
We introduce SummaryNet as a supervised learning framework for automated video summarisation.
It employs a two-stream convolutional network to learn spatial (appearance) and temporal (motion) representations.
arXiv Detail & Related papers (2020-02-19T18:24:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.