Exploiting VLM Localizability and Semantics for Open Vocabulary Action Detection
- URL: http://arxiv.org/abs/2411.10922v1
- Date: Sun, 17 Nov 2024 00:39:59 GMT
- Title: Exploiting VLM Localizability and Semantics for Open Vocabulary Action Detection
- Authors: Wentao Bao, Kai Li, Yuxiao Chen, Deep Patel, Martin Renqiang Min, Yu Kong,
- Abstract summary: Action detection aims to detect (recognize and localize) human actions spatially and temporally in videos.
Existing approaches focus on the closed-set setting where an action detector is trained and tested on videos from a fixed set of action categories.
We propose a novel method OpenMixer that exploits the inherent semantics and localizability of large vision-language models.
- Score: 32.843848754881364
- License:
- Abstract: Action detection aims to detect (recognize and localize) human actions spatially and temporally in videos. Existing approaches focus on the closed-set setting where an action detector is trained and tested on videos from a fixed set of action categories. However, this constrained setting is not viable in an open world where test videos inevitably come beyond the trained action categories. In this paper, we address the practical yet challenging Open-Vocabulary Action Detection (OVAD) problem. It aims to detect any action in test videos while training a model on a fixed set of action categories. To achieve such an open-vocabulary capability, we propose a novel method OpenMixer that exploits the inherent semantics and localizability of large vision-language models (VLM) within the family of query-based detection transformers (DETR). Specifically, the OpenMixer is developed by spatial and temporal OpenMixer blocks (S-OMB and T-OMB), and a dynamically fused alignment (DFA) module. The three components collectively enjoy the merits of strong generalization from pre-trained VLMs and end-to-end learning from DETR design. Moreover, we established OVAD benchmarks under various settings, and the experimental results show that the OpenMixer performs the best over baselines for detecting seen and unseen actions. We release the codes, models, and dataset splits at https://github.com/Cogito2012/OpenMixer.
Related papers
- Weakly Supervised Video Anomaly Detection and Localization with Spatio-Temporal Prompts [57.01985221057047]
This paper introduces a novel method that learnstemporal prompt embeddings for weakly supervised video anomaly detection and localization (WSVADL) based on pre-trained vision-language models (VLMs)
Our method achieves state-of-theart performance on three public benchmarks for the WSVADL task.
arXiv Detail & Related papers (2024-08-12T03:31:29Z) - OpenDAS: Open-Vocabulary Domain Adaptation for 2D and 3D Segmentation [54.98688607911399]
We propose the task of open-vocabulary domain adaptation to infuse domain-specific knowledge into Vision-Language Models (VLMs)
Existing VLM adaptation methods improve performance on base (training) queries, but fail to preserve the open-set capabilities of VLMs on novel queries.
Our approach is the only parameter-efficient method that consistently surpasses the original VLM on novel classes.
arXiv Detail & Related papers (2024-05-30T15:16:06Z) - Open-Vocabulary Spatio-Temporal Action Detection [59.91046192096296]
Open-vocabulary-temporal action detection (OV-STAD) is an important fine-grained video understanding task.
OV-STAD requires training a model on a limited set of base classes with box and label supervision.
To better adapt the holistic VLM for the fine-grained action detection task, we carefully fine-tune it on the localized video region-text pairs.
arXiv Detail & Related papers (2024-05-17T14:52:47Z) - SOAR: Scene-debiasing Open-set Action Recognition [81.8198917049666]
We propose Scene-debiasing Open-set Action Recognition (SOAR), which features an adversarial scene reconstruction module and an adaptive adversarial scene classification module.
The former prevents the decoder from reconstructing the video background given video features, and thus helps reduce the background information in feature learning.
The latter aims to confuse scene type classification given video features, with a specific emphasis on the action foreground, and helps to learn scene-invariant information.
arXiv Detail & Related papers (2023-09-03T20:20:48Z) - STMixer: A One-Stage Sparse Action Detector [48.0614066856134]
We propose a new one-stage action detector, termed STMixer.
We present a query-based adaptive feature sampling module, which endows our STMixer with the flexibility of mining a set of discriminative video features.
We obtain the state-of-the-art results on the datasets of AVA, UCF101-24, and JHMDB.
arXiv Detail & Related papers (2023-03-28T10:47:06Z) - Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision
and Language Models [67.31684040281465]
We present textbfMOV, a simple yet effective method for textbfMultimodal textbfOpen-textbfVocabulary video classification.
In MOV, we directly use the vision encoder from pre-trained VLMs with minimal modifications to encode video, optical flow and audio spectrogram.
arXiv Detail & Related papers (2022-07-15T17:59:11Z) - Evidential Deep Learning for Open Set Action Recognition [36.350348194248014]
We formulate the action recognition problem from the evidential deep learning (EDL) perspective.
We propose a plug-and-play module to debias the learned representation through contrastive learning.
arXiv Detail & Related papers (2021-07-21T15:45:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.