Related papers: Interaction-Aware Prompting for Zero-Shot Spatio-Temporal Action Detection

Interaction-Aware Prompting for Zero-Shot Spatio-Temporal Action Detection

URL: http://arxiv.org/abs/2304.04688v4
Date: Wed, 20 Sep 2023 14:14:16 GMT
Title: Interaction-Aware Prompting for Zero-Shot Spatio-Temporal Action Detection
Authors: Wei-Jhe Huang, Jheng-Hsien Yeh, Min-Hung Chen, Gueter Josmy Faure, Shang-Hong Lai
Abstract summary: spatial-temporal action detection is to determine the time and place where each person's action occurs in a video. Most of the existing methods adopt fully-supervised learning, which requires a large amount of training data. We propose to utilize a pre-trained visual-language model to extract the representative image and text features.
Score: 12.109835641702784
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The goal of spatial-temporal action detection is to determine the time and place where each person's action occurs in a video and classify the corresponding action category. Most of the existing methods adopt fully-supervised learning, which requires a large amount of training data, making it very difficult to achieve zero-shot learning. In this paper, we propose to utilize a pre-trained visual-language model to extract the representative image and text features, and model the relationship between these features through different interaction modules to obtain the interaction feature. In addition, we use this feature to prompt each label to obtain more appropriate text features. Finally, we calculate the similarity between the interaction feature and the text feature for each label to determine the action category. Our experiments on J-HMDB and UCF101-24 datasets demonstrate that the proposed interaction module and prompting make the visual-language features better aligned, thus achieving excellent accuracy for zero-shot spatio-temporal action detection. The code will be available at https://github.com/webber2933/iCLIP.

Related papers

Spatio-Temporal Context Prompting for Zero-Shot Action Detection [13.22912547389941]
We propose a method which can effectively leverage the rich knowledge of visual-language models to perform Person-Context Interaction. To address the challenge of recognizing distinct actions by multiple people at the same timestamp, we design the Interest Token Spotting mechanism. Our method achieves superior results compared to previous approaches and can be further extended to multi-action videos.
arXiv Detail & Related papers (2024-08-28T17:59:05Z)
Exploring Conditional Multi-Modal Prompts for Zero-shot HOI Detection [37.57355457749918]
We introduce a novel framework for zero-shot HOI detection using Conditional Multi-Modal Prompts, namely CMMP. Unlike traditional prompt-learning methods, we propose learning decoupled vision and language prompts for interactiveness-aware visual feature extraction. Experiments demonstrate the efficacy of our detector with conditional multi-modal prompts, outperforming previous state-of-the-art on unseen classes of various zero-shot settings.
arXiv Detail & Related papers (2024-08-05T14:05:25Z)
Towards Zero-shot Human-Object Interaction Detection via Vision-Language Integration [14.678931157058363]
We propose a novel framework, termed Knowledge Integration to HOI (KI2HOI), that effectively integrates the knowledge of visual-language model to improve zero-shot HOI detection. We develop an effective additive self-attention mechanism to generate more comprehensive visual representations. Our model outperforms the previous methods in various zero-shot and full-supervised settings.
arXiv Detail & Related papers (2024-03-12T02:07:23Z)
Disentangled Interaction Representation for One-Stage Human-Object Interaction Detection [70.96299509159981]
Human-Object Interaction (HOI) detection is a core task for human-centric image understanding. Recent one-stage methods adopt a transformer decoder to collect image-wide cues that are useful for interaction prediction. Traditional two-stage methods benefit significantly from their ability to compose interaction features in a disentangled and explainable manner.
arXiv Detail & Related papers (2023-12-04T08:02:59Z)
Enhancing HOI Detection with Contextual Cues from Large Vision-Language Models [56.257840490146]
ConCue is a novel approach for improving visual feature extraction in HOI detection. We develop a transformer-based feature extraction module with a multi-tower architecture that integrates contextual cues into both instance and interaction detectors.
arXiv Detail & Related papers (2023-11-26T09:11:32Z)
ZEETAD: Adapting Pretrained Vision-Language Model for Zero-Shot End-to-End Temporal Action Detection [10.012716326383567]
Temporal action detection (TAD) involves the localization and classification of action instances within untrimmed videos. We present ZEETAD, featuring two modules: dual-localization and zero-shot proposal classification. We enhance discriminative capability on unseen classes by minimally updating the frozen CLIP encoder with lightweight adapters.
arXiv Detail & Related papers (2023-11-01T00:17:37Z)
Learning Action-Effect Dynamics from Pairs of Scene-graphs [50.72283841720014]
We propose a novel method that leverages scene-graph representation of images to reason about the effects of actions described in natural language. Our proposed approach is effective in terms of performance, data efficiency, and generalization capability compared to existing models.
arXiv Detail & Related papers (2022-12-07T03:36:37Z)
End-to-End Semi-Supervised Learning for Video Action Detection [23.042410033982193]
We propose a simple end-to-end based approach effectively which utilizes the unlabeled data. Video action detection requires both, action class prediction as well as a-temporal consistency. We demonstrate the effectiveness of the proposed approach on two different action detection benchmark datasets.
arXiv Detail & Related papers (2022-03-08T18:11:25Z)
Glance and Gaze: Inferring Action-aware Points for One-Stage Human-Object Interaction Detection [81.32280287658486]
We propose a novel one-stage method, namely Glance and Gaze Network (GGNet) GGNet adaptively models a set of actionaware points (ActPoints) via glance and gaze steps. We design an actionaware approach that effectively matches each detected interaction with its associated human-object pair.
arXiv Detail & Related papers (2021-04-12T08:01:04Z)
A Graph-based Interactive Reasoning for Human-Object Interaction Detection [71.50535113279551]
We present a novel graph-based interactive reasoning model called Interactive Graph (abbr. in-Graph) to infer HOIs. We construct a new framework to assemble in-Graph models for detecting HOIs, namely in-GraphNet. Our framework is end-to-end trainable and free from costly annotations like human pose.
arXiv Detail & Related papers (2020-07-14T09:29:03Z)
Spatio-Temporal Action Detection with Multi-Object Interaction [127.85524354900494]
In this paper, we study the S-temporal action detection problem with multi-object interaction. We introduce a new dataset that is spatially annotated with action tubes containing multi-object interactions. We propose an end-to-endtemporal action detection model that performs both spatial and temporal regression simultaneously.
arXiv Detail & Related papers (2020-04-01T00:54:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.