Actor-agnostic Multi-label Action Recognition with Multi-modal Query
- URL: http://arxiv.org/abs/2307.10763v3
- Date: Wed, 10 Jan 2024 12:18:40 GMT
- Title: Actor-agnostic Multi-label Action Recognition with Multi-modal Query
- Authors: Anindya Mondal, Sauradip Nag, Joaquin M Prada, Xiatian Zhu, Anjan
Dutta
- Abstract summary: Existing action recognition methods are typically actor-specific.
This requires actor-specific pose estimation (e.g., humans vs. animals)
We propose a new approach called 'actor-agnostic multi-modal multi-label action recognition'
- Score: 42.38571663534819
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing action recognition methods are typically actor-specific due to the
intrinsic topological and apparent differences among the actors. This requires
actor-specific pose estimation (e.g., humans vs. animals), leading to
cumbersome model design complexity and high maintenance costs. Moreover, they
often focus on learning the visual modality alone and single-label
classification whilst neglecting other available information sources (e.g.,
class name text) and the concurrent occurrence of multiple actions. To overcome
these limitations, we propose a new approach called 'actor-agnostic multi-modal
multi-label action recognition,' which offers a unified solution for various
types of actors, including humans and animals. We further formulate a novel
Multi-modal Semantic Query Network (MSQNet) model in a transformer-based object
detection framework (e.g., DETR), characterized by leveraging visual and
textual modalities to represent the action classes better. The elimination of
actor-specific model designs is a key advantage, as it removes the need for
actor pose estimation altogether. Extensive experiments on five publicly
available benchmarks show that our MSQNet consistently outperforms the prior
arts of actor-specific alternatives on human and animal single- and multi-label
action recognition tasks by up to 50%. Code is made available at
https://github.com/mondalanindya/MSQNet.
Related papers
- DOAD: Decoupled One Stage Action Detection Network [77.14883592642782]
Localizing people and recognizing their actions from videos is a challenging task towards high-level video understanding.
Existing methods are mostly two-stage based, with one stage for person bounding box generation and the other stage for action recognition.
We present a decoupled one-stage network dubbed DOAD, to improve the efficiency for-temporal action detection.
arXiv Detail & Related papers (2023-04-01T08:06:43Z) - MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form
Video Question Answering [73.61182342844639]
We introduce a new model named Multi-modal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA.
MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules.
Visual concepts at different granularities are then processed efficiently through an attention module.
arXiv Detail & Related papers (2022-12-19T15:05:40Z) - Multi-Modal Few-Shot Temporal Action Detection [157.96194484236483]
Few-shot (FS) and zero-shot (ZS) learning are two different approaches for scaling temporal action detection to new classes.
We introduce a new multi-modality few-shot (MMFS) TAD problem, which can be considered as a marriage of FS-TAD and ZS-TAD.
arXiv Detail & Related papers (2022-11-27T18:13:05Z) - Multi-Modal Few-Shot Object Detection with Meta-Learning-Based
Cross-Modal Prompting [77.69172089359606]
We study multi-modal few-shot object detection (FSOD) in this paper, using both few-shot visual examples and class semantic information for detection.
Our approach is motivated by the high-level conceptual similarity of (metric-based) meta-learning and prompt-based learning.
We comprehensively evaluate the proposed multi-modal FSOD models on multiple few-shot object detection benchmarks, achieving promising results.
arXiv Detail & Related papers (2022-04-16T16:45:06Z) - MIST: Multiple Instance Self-Training Framework for Video Anomaly
Detection [76.80153360498797]
We develop a multiple instance self-training framework (MIST) to efficiently refine task-specific discriminative representations.
MIST is composed of 1) a multiple instance pseudo label generator, which adapts a sparse continuous sampling strategy to produce more reliable clip-level pseudo labels, and 2) a self-guided attention boosted feature encoder.
Our method performs comparably to or even better than existing supervised and weakly supervised methods, specifically obtaining a frame-level AUC 94.83% on ShanghaiTech.
arXiv Detail & Related papers (2021-04-04T15:47:14Z) - Discovering Multi-Label Actor-Action Association in a Weakly Supervised
Setting [22.86745487695168]
We propose a baseline based on multi-instance and multi-label learning.
We propose a novel approach that uses sets of actions as representation instead of modeling individual action classes.
We evaluate the proposed approach on the challenging dataset where the proposed approach outperforms the MIML baseline and is competitive to fully supervised approaches.
arXiv Detail & Related papers (2021-01-21T11:59:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.