Related papers: Test-Time Zero-Shot Temporal Action Localization

Test-Time Zero-Shot Temporal Action Localization

URL: http://arxiv.org/abs/2404.05426v2
Date: Thu, 11 Apr 2024 07:12:35 GMT
Title: Test-Time Zero-Shot Temporal Action Localization
Authors: Benedetta Liberatori, Alessandro Conti, Paolo Rota, Yiming Wang, Elisa Ricci,
Abstract summary: ZS-TAL seeks to identify and locate actions in untrimmed videos unseen during training. Training-based ZS-TAL approaches assume the availability of labeled data for supervised learning. We introduce a novel method that performs Test-Time adaptation for Temporal Action localization (T3AL)
Score: 58.84919541314969
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Zero-Shot Temporal Action Localization (ZS-TAL) seeks to identify and locate actions in untrimmed videos unseen during training. Existing ZS-TAL methods involve fine-tuning a model on a large amount of annotated training data. While effective, training-based ZS-TAL approaches assume the availability of labeled data for supervised learning, which can be impractical in some applications. Furthermore, the training process naturally induces a domain bias into the learned model, which may adversely affect the model's generalization ability to arbitrary videos. These considerations prompt us to approach the ZS-TAL problem from a radically novel perspective, relaxing the requirement for training data. To this aim, we introduce a novel method that performs Test-Time adaptation for Temporal Action Localization (T3AL). In a nutshell, T3AL adapts a pre-trained Vision and Language Model (VLM). T3AL operates in three steps. First, a video-level pseudo-label of the action category is computed by aggregating information from the entire video. Then, action localization is performed adopting a novel procedure inspired by self-supervised learning. Finally, frame-level textual descriptions extracted with a state-of-the-art captioning model are employed for refining the action region proposals. We validate the effectiveness of T3AL by conducting experiments on the THUMOS14 and the ActivityNet-v1.3 datasets. Our results demonstrate that T3AL significantly outperforms zero-shot baselines based on state-of-the-art VLMs, confirming the benefit of a test-time adaptation approach.

Related papers

Training-Free Zero-Shot Temporal Action Detection with Vision-Language Models [15.17499718666202]
We propose a training-Free Zero-shot temporal Action Detection (FreeZAD) method. We leverage existing vision-language (ViL) models to directly classify and localize unseen activities within untrimmed videos. Our training-free method outperforms state-of-the-art unsupervised methods while requiring only 1/13 of the runtime.
arXiv Detail & Related papers (2025-01-23T16:13:58Z)
UAL-Bench: The First Comprehensive Unusual Activity Localization Benchmark [20.15425745473231]
Localizing unusual activities, such as human errors or surveillance incidents, in videos holds practical significance. To explore foundation models' capability in localizing unusual activity, we introduce UAL-Bench. UAL-Bench features three video datasets: UAG-OOPS, UAG- SSBD, UAG-FunQA, and an instruction-tune dataset: OOPS-UAG-Instruct. Our results show the VLM-LLM approach excels in localizing short-span unusual events and predicting their onset (start time) more accurately than Vid-LLMs.
arXiv Detail & Related papers (2024-10-02T02:33:09Z)
Probabilistic Vision-Language Representation for Weakly Supervised Temporal Action Localization [3.996503381756227]
Weakly supervised temporal action localization (WTAL) aims to detect action instances in untrimmed videos using only video-level annotations. We propose a novel framework that aligns human action knowledge and semantic knowledge in a probabilistic embedding space. Our method significantly outperforms all previous state-of-the-art methods.
arXiv Detail & Related papers (2024-08-12T07:09:12Z)
Open-Vocabulary Spatio-Temporal Action Detection [59.91046192096296]
Open-vocabulary-temporal action detection (OV-STAD) is an important fine-grained video understanding task. OV-STAD requires training a model on a limited set of base classes with box and label supervision. To better adapt the holistic VLM for the fine-grained action detection task, we carefully fine-tune it on the localized video region-text pairs.
arXiv Detail & Related papers (2024-05-17T14:52:47Z)
STAT: Towards Generalizable Temporal Action Localization [56.634561073746056]
Weakly-supervised temporal action localization (WTAL) aims to recognize and localize action instances with only video-level labels. Existing methods suffer from severe performance degradation when transferring to different distributions. We propose GTAL, which focuses on improving the generalizability of action localization methods.
arXiv Detail & Related papers (2024-04-20T07:56:21Z)
Harnessing Large Language Models for Training-free Video Anomaly Detection [34.76811491190446]
Video anomaly detection (VAD) aims to temporally locate abnormal events in a video. Training-based methods are prone to be domain-specific, thus being costly for practical deployment. We propose LAnguage-based VAD (LAVAD), a method tackling VAD in a novel, training-free paradigm.
arXiv Detail & Related papers (2024-04-01T09:34:55Z)
Towards Seamless Adaptation of Pre-trained Models for Visual Place Recognition [72.35438297011176]
We propose a novel method to realize seamless adaptation of pre-trained models for visual place recognition (VPR) Specifically, to obtain both global and local features that focus on salient landmarks for discriminating places, we design a hybrid adaptation method. Experimental results show that our method outperforms the state-of-the-art methods with less training data and training time.
arXiv Detail & Related papers (2024-02-22T12:55:01Z)
Active Learning with Effective Scoring Functions for Semi-Supervised Temporal Action Localization [15.031156121516211]
This paper focuses on a rarely investigated yet practical task named semi-supervised TAL. We propose an effective active learning method, named AL-STAL. Experiment results show that AL-STAL outperforms the existing competitors and achieves satisfying performance compared with fully-supervised learning.
arXiv Detail & Related papers (2022-08-31T13:39:38Z)
Zero-Shot Temporal Action Detection via Vision-Language Prompting [134.26292288193298]
We propose a novel zero-Shot Temporal Action detection model via Vision-LanguagE prompting (STALE) Our model significantly outperforms state-of-the-art alternatives. Our model also yields superior results on supervised TAD over recent strong competitors.
arXiv Detail & Related papers (2022-07-17T13:59:46Z)
Delving into 3D Action Anticipation from Streaming Videos [99.0155538452263]
Action anticipation aims to recognize the action with a partial observation. We introduce several complementary evaluation metrics and present a basic model based on frame-wise action classification. We also explore multi-task learning strategies by incorporating auxiliary information from two aspects: the full action representation and the class-agnostic action label.
arXiv Detail & Related papers (2019-06-15T10:30:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.