ViLP: Knowledge Exploration using Vision, Language, and Pose Embeddings
for Video Action Recognition
- URL: http://arxiv.org/abs/2308.03908v1
- Date: Mon, 7 Aug 2023 20:50:54 GMT
- Title: ViLP: Knowledge Exploration using Vision, Language, and Pose Embeddings
for Video Action Recognition
- Authors: Soumyabrata Chaudhuri and Saumik Bhattacharya
- Abstract summary: We present the first pose augmented Vision-language model (VLM) for Video Action Recognition.
Notably, our scheme achieves an accuracy of 92.81% and 73.02% on two popular human video action recognition benchmark datasets.
- Score: 4.36572039512405
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video Action Recognition (VAR) is a challenging task due to its inherent
complexities. Though different approaches have been explored in the literature,
designing a unified framework to recognize a large number of human actions is
still a challenging problem. Recently, Multi-Modal Learning (MML) has
demonstrated promising results in this domain. In literature, 2D skeleton or
pose modality has often been used for this task, either independently or in
conjunction with the visual information (RGB modality) present in videos.
However, the combination of pose, visual information, and text attributes has
not been explored yet, though text and pose attributes independently have been
proven to be effective in numerous computer vision tasks. In this paper, we
present the first pose augmented Vision-language model (VLM) for VAR. Notably,
our scheme achieves an accuracy of 92.81% and 73.02% on two popular human video
action recognition benchmark datasets, UCF-101 and HMDB-51, respectively, even
without any video data pre-training, and an accuracy of 96.11% and 75.75% after
kinetics pre-training.
Related papers
- Revisiting Feature Prediction for Learning Visual Representations from Video [62.08833572467379]
V-JEPA is a collection of vision models trained solely using a feature prediction objective.
The models are trained on 2 million videos collected from public datasets.
Our results show that learning by predicting video features leads to versatile visual representations that perform well on both motion and appearance-based tasks.
arXiv Detail & Related papers (2024-02-15T18:59:11Z) - Advancing Human Action Recognition with Foundation Models trained on Unlabeled Public Videos [2.3247413495885647]
We use 283,582 unique, unlabeled TikTok video clips, categorized into 386 hashtags, to train a domain-specific foundation model for action recognition.
Our model achieves state-of-the-art results: 99.05% on UCF101, 86.08% on HMDB51, 85.51% on Kinetics-400, and 74.27% on Something-Something V2 using the ViT-giant backbone.
arXiv Detail & Related papers (2024-02-14T00:41:10Z) - AV-MaskEnhancer: Enhancing Video Representations through Audio-Visual
Masked Autoencoder [3.8735222804007394]
We propose AV-MaskEnhancer for learning high-quality video representation by combining visual and audio information.
Our approach addresses the challenge by demonstrating the complementary nature of audio and video features in cross-modality content.
arXiv Detail & Related papers (2023-09-15T19:56:15Z) - NPF-200: A Multi-Modal Eye Fixation Dataset and Method for
Non-Photorealistic Videos [51.409547544747284]
NPF-200 is the first large-scale multi-modal dataset of purely non-photorealistic videos with eye fixations.
We conduct a series of analyses to gain deeper insights into this task.
We propose a universal frequency-aware multi-modal non-photorealistic saliency detection model called NPSNet.
arXiv Detail & Related papers (2023-08-23T14:25:22Z) - Video-based Person Re-identification with Long Short-Term Representation
Learning [101.62570747820541]
Video-based person Re-Identification (V-ReID) aims to retrieve specific persons from raw videos captured by non-overlapped cameras.
We propose a novel deep learning framework named Long Short-Term Representation Learning (LSTRL) for effective V-ReID.
arXiv Detail & Related papers (2023-08-07T16:22:47Z) - Seeing the Pose in the Pixels: Learning Pose-Aware Representations in
Vision Transformers [1.8047694351309207]
We introduce two strategies for learning pose-aware representations in Vision Transformer (ViT)
The first method, called Pose-aware Attention Block (PAAB), is a plug-and-play ViT block that performs localized attention on pose regions within videos.
The second method, dubbed Pose-Aware Auxiliary Task (PAAT), presents an auxiliary pose prediction task optimized jointly with the primary ViT task.
arXiv Detail & Related papers (2023-06-15T17:58:39Z) - Bidirectional Cross-Modal Knowledge Exploration for Video Recognition
with Pre-trained Vision-Language Models [149.1331903899298]
We propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge.
We present a Temporal Concept Spotting mechanism that uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner.
Our best model achieves a state-of-the-art accuracy of 88.6% on the challenging Kinetics-400 using the released CLIP model.
arXiv Detail & Related papers (2022-12-31T11:36:53Z) - InternVideo: General Video Foundation Models via Generative and
Discriminative Learning [52.69422763715118]
We present general video foundation models, InternVideo, for dynamic and complex video-level understanding tasks.
InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives.
InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications.
arXiv Detail & Related papers (2022-12-06T18:09:49Z) - Exploit Clues from Views: Self-Supervised and Regularized Learning for
Multiview Object Recognition [66.87417785210772]
This work investigates the problem of multiview self-supervised learning (MV-SSL)
A novel surrogate task for self-supervised learning is proposed by pursuing "object invariant" representation.
Experiments shows that the recognition and retrieval results using view invariant prototype embedding (VISPE) outperform other self-supervised learning methods.
arXiv Detail & Related papers (2020-03-28T07:06:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.