Rich Action-semantic Consistent Knowledge for Early Action Prediction
- URL: http://arxiv.org/abs/2201.09169v3
- Date: Wed, 20 Dec 2023 08:58:03 GMT
- Title: Rich Action-semantic Consistent Knowledge for Early Action Prediction
- Authors: Xiaoli Liu, Jianqin Yin, Di Guo, and Huaping Liu
- Abstract summary: Early action prediction (EAP) aims to recognize human actions from a part of action execution in ongoing videos.
We partition original partial or full videos to form a new series of partial videos evolving in arbitrary progress levels.
A novel Rich Action-semantic Consistent Knowledge network (RACK) under the teacher-student framework is proposed for EAP.
- Score: 20.866206453146898
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Early action prediction (EAP) aims to recognize human actions from a part of
action execution in ongoing videos, which is an important task for many
practical applications. Most prior works treat partial or full videos as a
whole, ignoring rich action knowledge hidden in videos, i.e., semantic
consistencies among different partial videos. In contrast, we partition
original partial or full videos to form a new series of partial videos and mine
the Action-Semantic Consistent Knowledge (ASCK) among these new partial videos
evolving in arbitrary progress levels. Moreover, a novel Rich Action-semantic
Consistent Knowledge network (RACK) under the teacher-student framework is
proposed for EAP. Firstly, we use a two-stream pre-trained model to extract
features of videos. Secondly, we treat the RGB or flow features of the partial
videos as nodes and their action semantic consistencies as edges. Next, we
build a bi-directional semantic graph for the teacher network and a
single-directional semantic graph for the student network to model rich ASCK
among partial videos. The MSE and MMD losses are incorporated as our
distillation loss to enrich the ASCK of partial videos from the teacher to the
student network. Finally, we obtain the final prediction by summering the
logits of different subnetworks and applying a softmax layer. Extensive
experiments and ablative studies have been conducted, demonstrating the
effectiveness of modeling rich ASCK for EAP. With the proposed RACK, we have
achieved state-of-the-art performance on three benchmarks. The code is
available at https://github.com/lily2lab/RACK.git.
Related papers
- ActionHub: A Large-scale Action Video Description Dataset for Zero-shot
Action Recognition [35.08592533014102]
Zero-shot action recognition (ZSAR) aims to learn an alignment model between videos and class descriptions of seen actions that is transferable to unseen actions.
We propose a novel Cross-modality and Cross-action Modeling (CoCo) framework for ZSAR, which consists of a Dual Cross-modality Alignment module and a Cross-action Invariance Mining module.
arXiv Detail & Related papers (2024-01-22T02:21:26Z) - Early Action Recognition with Action Prototypes [62.826125870298306]
We propose a novel model that learns a prototypical representation of the full action for each class.
We decompose the video into short clips, where a visual encoder extracts features from each clip independently.
Later, a decoder aggregates together in an online fashion features from all the clips for the final class prediction.
arXiv Detail & Related papers (2023-12-11T18:31:13Z) - Paxion: Patching Action Knowledge in Video-Language Foundation Models [112.92853632161604]
Action knowledge involves the understanding of textual, visual, and temporal aspects of actions.
Recent video-language models' impressive performance on various benchmark tasks reveal their surprising deficiency (near-random performance) in action knowledge.
We propose a novel framework, Paxion, along with a new Discriminative Video Dynamics Modeling (DVDM) objective.
arXiv Detail & Related papers (2023-05-18T03:53:59Z) - Frame-wise Action Representations for Long Videos via Sequence
Contrastive Learning [44.412145665354736]
We introduce a novel contrastive action representation learning framework to learn frame-wise action representations.
Inspired by the recent progress of self-supervised learning, we present a novel sequence contrastive loss (SCL) applied on two correlated views.
Our approach also shows outstanding performance on video alignment and fine-grained frame retrieval tasks.
arXiv Detail & Related papers (2022-03-28T17:59:54Z) - Bridge-Prompt: Towards Ordinal Action Understanding in Instructional
Videos [92.18898962396042]
We propose a prompt-based framework, Bridge-Prompt, to model the semantics across adjacent actions.
We reformulate the individual action labels as integrated text prompts for supervision, which bridge the gap between individual action semantics.
Br-Prompt achieves state-of-the-art on multiple benchmarks.
arXiv Detail & Related papers (2022-03-26T15:52:27Z) - Adversarial Memory Networks for Action Prediction [95.09968654228372]
Action prediction aims to infer the forthcoming human action with partially-observed videos.
We propose adversarial memory networks (AMemNet) to generate the "full video" feature conditioning on a partial video query.
arXiv Detail & Related papers (2021-12-18T08:16:21Z) - TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment [68.08689660963468]
A new algorithm called Token-Aware Cascade contrastive learning (TACo) improves contrastive learning using two novel techniques.
We set new state-of-the-art on three public text-video retrieval benchmarks of YouCook2, MSR-VTT and ActivityNet.
arXiv Detail & Related papers (2021-08-23T07:24:57Z) - Hybrid Dynamic-static Context-aware Attention Network for Action
Assessment in Long Videos [96.45804577283563]
We present a novel hybrid dynAmic-static Context-aware attenTION NETwork (ACTION-NET) for action assessment in long videos.
We learn the video dynamic information but also focus on the static postures of the detected athletes in specific frames.
We combine the features of the two streams to regress the final video score, supervised by ground-truth scores given by experts.
arXiv Detail & Related papers (2020-08-13T15:51:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.