Text-driven Online Action Detection
- URL: http://arxiv.org/abs/2501.13518v1
- Date: Thu, 23 Jan 2025 10:06:52 GMT
- Title: Text-driven Online Action Detection
- Authors: Manuel Benavent-Lledo, David Mulero-PĂ©rez, David Ortiz-Perez, Jose Garcia-Rodriguez,
- Abstract summary: We introduce TOAD: a Text-driven Online Action Detection architecture that supports zero-shot and few-shot learning.
Our model achieves 82.46% mAP on the THUMOS14 dataset, outperforming existing methods.
- Score: 0.0
- License:
- Abstract: Detecting actions as they occur is essential for applications like video surveillance, autonomous driving, and human-robot interaction. Known as online action detection, this task requires classifying actions in streaming videos, handling background noise, and coping with incomplete actions. Transformer architectures are the current state-of-the-art, yet the potential of recent advancements in computer vision, particularly vision-language models (VLMs), remains largely untapped for this problem, partly due to high computational costs. In this paper, we introduce TOAD: a Text-driven Online Action Detection architecture that supports zero-shot and few-shot learning. TOAD leverages CLIP (Contrastive Language-Image Pretraining) textual embeddings, enabling efficient use of VLMs without significant computational overhead. Our model achieves 82.46% mAP on the THUMOS14 dataset, outperforming existing methods, and sets new baselines for zero-shot and few-shot performance on the THUMOS14 and TVSeries datasets.
Related papers
- Object-Centric Latent Action Learning [70.3173534658611]
We propose a novel object-centric latent action learning approach, based on VideoSaur and LAPO.
This method effectively disentangles causal agent-object interactions from irrelevant background noise and reduces the performance degradation caused by distractors.
Our preliminary experiments with the Distracting Control Suite show that latent action pretraining based on object decompositions improve the quality of inferred latent actions by x2.7 and efficiency of downstream fine-tuning with a small set of labeled actions, increasing return by x2.6 on average.
arXiv Detail & Related papers (2025-02-13T11:27:05Z) - Text-Enhanced Zero-Shot Action Recognition: A training-free approach [13.074211474150914]
We propose Text-Enhanced Action Recognition (TEAR) for zero-shot video action recognition.
TEAR is training-free and does not require the availability of training data or extensive computational resources.
arXiv Detail & Related papers (2024-08-29T10:20:05Z) - Spatio-Temporal Context Prompting for Zero-Shot Action Detection [13.22912547389941]
We propose a method which can effectively leverage the rich knowledge of visual-language models to perform Person-Context Interaction.
To address the challenge of recognizing distinct actions by multiple people at the same timestamp, we design the Interest Token Spotting mechanism.
Our method achieves superior results compared to previous approaches and can be further extended to multi-action videos.
arXiv Detail & Related papers (2024-08-28T17:59:05Z) - ActNetFormer: Transformer-ResNet Hybrid Method for Semi-Supervised Action Recognition in Videos [4.736059095502584]
This work proposes a novel approach using Cross-Architecture Pseudo-Labeling with contrastive learning for semi-supervised action recognition.
We introduce a novel cross-architecture approach where 3D Convolutional Neural Networks (3D CNNs) and video transformers (VIT) are utilised to capture different aspects of action representations.
arXiv Detail & Related papers (2024-04-09T12:09:56Z) - TalkNCE: Improving Active Speaker Detection with Talk-Aware Contrastive
Learning [15.673602262069531]
Active Speaker Detection (ASD) is a task to determine whether a person is speaking or not in a series of video frames.
We propose TalkNCE, a novel talk-aware contrastive loss.
Our method achieves state-of-the-art performances on AVA-ActiveSpeaker and ASW datasets.
arXiv Detail & Related papers (2023-09-21T17:59:11Z) - Look, Remember and Reason: Grounded reasoning in videos with language
models [5.3445140425713245]
Multi-temporal language models (LM) have recently shown promising performance in high-level reasoning tasks on videos.
We propose training an LM end-to-end on low-level surrogate tasks, including object detection, re-identification, tracking, to endow the model with the required low-level visual capabilities.
We demonstrate the effectiveness of our framework on diverse visual reasoning tasks from the ACRE, CATER, Something-Else and STAR datasets.
arXiv Detail & Related papers (2023-06-30T16:31:14Z) - PIVOT: Prompting for Video Continual Learning [50.80141083993668]
We introduce PIVOT, a novel method that leverages extensive knowledge in pre-trained models from the image domain.
Our experiments show that PIVOT improves state-of-the-art methods by a significant 27% on the 20-task ActivityNet setup.
arXiv Detail & Related papers (2022-12-09T13:22:27Z) - SimOn: A Simple Framework for Online Temporal Action Localization [51.27476730635852]
We propose a framework, termed SimOn, that learns to predict action instances using the popular Transformer architecture.
Experimental results on the THUMOS14 and ActivityNet1.3 datasets show that our model remarkably outperforms the previous methods.
arXiv Detail & Related papers (2022-11-08T04:50:54Z) - mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation.
It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z) - Visual Imitation Made Easy [102.36509665008732]
We present an alternate interface for imitation that simplifies the data collection process while allowing for easy transfer to robots.
We use commercially available reacher-grabber assistive tools both as a data collection device and as the robot's end-effector.
We experimentally evaluate on two challenging tasks: non-prehensile pushing and prehensile stacking, with 1000 diverse demonstrations for each task.
arXiv Detail & Related papers (2020-08-11T17:58:50Z) - ZSTAD: Zero-Shot Temporal Activity Detection [107.63759089583382]
We propose a novel task setting called zero-shot temporal activity detection (ZSTAD), where activities that have never been seen in training can still be detected.
We design an end-to-end deep network based on R-C3D as the architecture for this solution.
Experiments on both the THUMOS14 and the Charades datasets show promising performance in terms of detecting unseen activities.
arXiv Detail & Related papers (2020-03-12T02:40:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.