Video Action Recognition with Attentive Semantic Units
- URL: http://arxiv.org/abs/2303.09756v2
- Date: Tue, 10 Oct 2023 13:31:22 GMT
- Title: Video Action Recognition with Attentive Semantic Units
- Authors: Yifei Chen, Dapeng Chen, Ruijin Liu, Hao Li, Wei Peng
- Abstract summary: We exploit the semantic units () hiding behind the action labels for more accurate action recognition.
We introduce a multi-region module (MRA) to the visual branch of the Visual-Language Models (VLMs)
In fully-supervised learning, our method achieved 87.8% top-1 accuracy on Kinetics-400.
- Score: 23.384091957466588
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual-Language Models (VLMs) have significantly advanced action video
recognition. Supervised by the semantics of action labels, recent works adapt
the visual branch of VLMs to learn video representations. Despite the
effectiveness proved by these works, we believe that the potential of VLMs has
yet to be fully harnessed. In light of this, we exploit the semantic units (SU)
hiding behind the action labels and leverage their correlations with
fine-grained items in frames for more accurate action recognition. SUs are
entities extracted from the language descriptions of the entire action set,
including body parts, objects, scenes, and motions. To further enhance the
alignments between visual contents and the SUs, we introduce a multi-region
module (MRA) to the visual branch of the VLM. The MRA allows the perception of
region-aware visual features beyond the original global feature. Our method
adaptively attends to and selects relevant SUs with visual features of frames.
With a cross-modal decoder, the selected SUs serve to decode spatiotemporal
video representations. In summary, the SUs as the medium can boost
discriminative ability and transferability. Specifically, in fully-supervised
learning, our method achieved 87.8% top-1 accuracy on Kinetics-400. In K=2
few-shot experiments, our method surpassed the previous state-of-the-art by
+7.1% and +15.0% on HMDB-51 and UCF-101, respectively.
Related papers
- Early Action Recognition with Action Prototypes [62.826125870298306]
We propose a novel model that learns a prototypical representation of the full action for each class.
We decompose the video into short clips, where a visual encoder extracts features from each clip independently.
Later, a decoder aggregates together in an online fashion features from all the clips for the final class prediction.
arXiv Detail & Related papers (2023-12-11T18:31:13Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - Align before Adapt: Leveraging Entity-to-Region Alignments for Generalizable Video Action Recognition [16.828560953073495]
We propose a novel "Align before Adapt" (ALT) paradigm for video representation learning.
We exploit the entity-to-region alignments for each frame. The alignments are fulfilled by matching the region-aware image embeddings to an offline-constructed text corpus.
ALT demonstrates competitive performance while maintaining remarkably low computational costs.
arXiv Detail & Related papers (2023-11-27T08:32:28Z) - MOFO: MOtion FOcused Self-Supervision for Video Understanding [11.641926922266347]
Self-supervised learning techniques have produced outstanding results in learning visual representations from unlabeled videos.
Despite the importance of motion in supervised learning techniques for action recognition, SSL methods often do not explicitly consider motion information in videos.
We propose MOFO, a novel SSL method for focusing representation learning on the motion area of a video, for action recognition.
arXiv Detail & Related papers (2023-08-23T22:03:57Z) - HomE: Homography-Equivariant Video Representation Learning [62.89516761473129]
We propose a novel method for representation learning of multi-view videos.
Our method learns an implicit mapping between different views, culminating in a representation space that maintains the homography relationship between neighboring views.
On action classification, our method obtains 96.4% 3-fold accuracy on the UCF101 dataset, better than most state-of-the-art self-supervised learning methods.
arXiv Detail & Related papers (2023-06-02T15:37:43Z) - Paxion: Patching Action Knowledge in Video-Language Foundation Models [112.92853632161604]
Action knowledge involves the understanding of textual, visual, and temporal aspects of actions.
Recent video-language models' impressive performance on various benchmark tasks reveal their surprising deficiency (near-random performance) in action knowledge.
We propose a novel framework, Paxion, along with a new Discriminative Video Dynamics Modeling (DVDM) objective.
arXiv Detail & Related papers (2023-05-18T03:53:59Z) - Bidirectional Cross-Modal Knowledge Exploration for Video Recognition
with Pre-trained Vision-Language Models [149.1331903899298]
We propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge.
We present a Temporal Concept Spotting mechanism that uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner.
Our best model achieves a state-of-the-art accuracy of 88.6% on the challenging Kinetics-400 using the released CLIP model.
arXiv Detail & Related papers (2022-12-31T11:36:53Z) - MoDist: Motion Distillation for Self-supervised Video Representation
Learning [27.05772951598066]
MoDist is a novel method to distill motion information into self-supervised video representations.
We show that MoDist focus more on foreground motion regions and thus generalizes better to downstream tasks.
arXiv Detail & Related papers (2021-06-17T17:57:11Z) - Video Representation Learning with Visual Tempo Consistency [105.20094164316836]
We show that visual tempo can serve as a self-supervision signal for video representation learning.
We propose to maximize the mutual information between representations of slow and fast videos via hierarchical contrastive learning.
arXiv Detail & Related papers (2020-06-28T02:46:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.