Are Visual-Language Models Effective in Action Recognition? A Comparative Study
- URL: http://arxiv.org/abs/2410.17149v1
- Date: Tue, 22 Oct 2024 16:28:21 GMT
- Title: Are Visual-Language Models Effective in Action Recognition? A Comparative Study
- Authors: Mahmoud Ali, Di Yang, François Brémond,
- Abstract summary: This paper provides a large-scale study and insight on state-of-the-art vision foundation models.
It compares their transfer ability onto zero-shot and frame-wise action recognition tasks.
Experiments are conducted on recent fine-grained, human-centric action recognition datasets.
- Score: 22.97135293252601
- License:
- Abstract: Current vision-language foundation models, such as CLIP, have recently shown significant improvement in performance across various downstream tasks. However, whether such foundation models significantly improve more complex fine-grained action recognition tasks is still an open question. To answer this question and better find out the future research direction on human behavior analysis in-the-wild, this paper provides a large-scale study and insight on current state-of-the-art vision foundation models by comparing their transfer ability onto zero-shot and frame-wise action recognition tasks. Extensive experiments are conducted on recent fine-grained, human-centric action recognition datasets (e.g., Toyota Smarthome, Penn Action, UAV-Human, TSU, Charades) including action classification and segmentation.
Related papers
- From CNNs to Transformers in Multimodal Human Action Recognition: A Survey [23.674123304219822]
Human action recognition is one of the most widely studied research problems in Computer Vision.
Recent studies have shown that addressing it using multimodal data leads to superior performance.
Recent rise of Transformers in visual modelling is now also causing a paradigm shift for the action recognition task.
arXiv Detail & Related papers (2024-05-22T02:11:18Z) - Corpus Considerations for Annotator Modeling and Scaling [9.263562546969695]
We show that the commonly used user token model consistently outperforms more complex models.
Our findings shed light on the relationship between corpus statistics and annotator modeling performance.
arXiv Detail & Related papers (2024-04-02T22:27:24Z) - Learning Action-Effect Dynamics for Hypothetical Vision-Language
Reasoning Task [50.72283841720014]
We propose a novel learning strategy that can improve reasoning about the effects of actions.
We demonstrate the effectiveness of our proposed approach and discuss its advantages over previous baselines in terms of performance, data efficiency, and generalization capability.
arXiv Detail & Related papers (2022-12-07T05:41:58Z) - Task Formulation Matters When Learning Continually: A Case Study in
Visual Question Answering [58.82325933356066]
Continual learning aims to train a model incrementally on a sequence of tasks without forgetting previous knowledge.
We present a detailed study of how different settings affect performance for Visual Question Answering.
arXiv Detail & Related papers (2022-09-30T19:12:58Z) - What Makes Good Contrastive Learning on Small-Scale Wearable-based
Tasks? [59.51457877578138]
We study contrastive learning on the wearable-based activity recognition task.
This paper presents an open-source PyTorch library textttCL-HAR, which can serve as a practical tool for researchers.
arXiv Detail & Related papers (2022-02-12T06:10:15Z) - Models, Pixels, and Rewards: Evaluating Design Trade-offs in Visual
Model-Based Reinforcement Learning [109.74041512359476]
We study a number of design decisions for the predictive model in visual MBRL algorithms.
We find that a range of design decisions that are often considered crucial, such as the use of latent spaces, have little effect on task performance.
We show how this phenomenon is related to exploration and how some of the lower-scoring models on standard benchmarks will perform the same as the best-performing models when trained on the same training data.
arXiv Detail & Related papers (2020-12-08T18:03:21Z) - Spatial-Temporal Alignment Network for Action Recognition and Detection [80.19235282200697]
This paper studies how to introduce viewpoint-invariant feature representations that can help action recognition and detection.
We propose a novel Spatial-Temporal Alignment Network (STAN) that aims to learn geometric invariant representations for action recognition and action detection.
We test our STAN model extensively on AVA, Kinetics-400, AVA-Kinetics, Charades, and Charades-Ego datasets.
arXiv Detail & Related papers (2020-12-04T06:23:40Z) - Recent Progress in Appearance-based Action Recognition [73.6405863243707]
Action recognition is a task to identify various human actions in a video.
Recent appearance-based methods have achieved promising progress towards accurate action recognition.
arXiv Detail & Related papers (2020-11-25T10:18:12Z) - A Grid-based Representation for Human Action Recognition [12.043574473965318]
Human action recognition (HAR) in videos is a fundamental research topic in computer vision.
We propose a novel method for action recognition that encodes efficiently the most discriminative appearance information of an action.
Our method is tested on several benchmark datasets demonstrating that our model can accurately recognize human actions.
arXiv Detail & Related papers (2020-10-17T18:25:00Z) - Sensor Data for Human Activity Recognition: Feature Representation and
Benchmarking [27.061240686613182]
The field of Human Activity Recognition (HAR) focuses on obtaining and analysing data captured from monitoring devices (e.g. sensors)
We address the issue of accurately recognising human activities using different Machine Learning (ML) techniques.
arXiv Detail & Related papers (2020-05-15T00:46:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.