TCAM: Temporal Class Activation Maps for Object Localization in
Weakly-Labeled Unconstrained Videos
- URL: http://arxiv.org/abs/2208.14542v1
- Date: Tue, 30 Aug 2022 21:20:34 GMT
- Title: TCAM: Temporal Class Activation Maps for Object Localization in
Weakly-Labeled Unconstrained Videos
- Authors: Soufiane Belharbi, Ismail Ben Ayed, Luke McCaffrey, Eric Granger
- Abstract summary: Weakly supervised object localization (WSVOL) allows object locating in videos using only global video tags as such object class.
In this paper, we leverage the successful class activation mapping (CAM) methods, designed for WSOL based on still images.
A new Temporal CAM (TCAM) method is introduced to train ariminant deep learning (DL) model to exploittemporal information in videos.
- Score: 22.271760669551817
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Weakly supervised video object localization (WSVOL) allows locating object in
videos using only global video tags such as object class. State-of-art methods
rely on multiple independent stages, where initial spatio-temporal proposals
are generated using visual and motion cues, then prominent objects are
identified and refined. Localization is done by solving an optimization problem
over one or more videos, and video tags are typically used for video
clustering. This requires a model per-video or per-class making for costly
inference. Moreover, localized regions are not necessary discriminant because
of unsupervised motion methods like optical flow, or because video tags are
discarded from optimization. In this paper, we leverage the successful class
activation mapping (CAM) methods, designed for WSOL based on still images. A
new Temporal CAM (TCAM) method is introduced to train a discriminant deep
learning (DL) model to exploit spatio-temporal information in videos, using an
aggregation mechanism, called CAM-Temporal Max Pooling (CAM-TMP), over
consecutive CAMs. In particular, activations of regions of interest (ROIs) are
collected from CAMs produced by a pretrained CNN classifier to build pixel-wise
pseudo-labels for training the DL model. In addition, a global unsupervised
size constraint, and local constraint such as CRF are used to yield more
accurate CAMs. Inference over single independent frames allows parallel
processing of a clip of frames, and real-time localization. Extensive
experiments on two challenging YouTube-Objects datasets for unconstrained
videos, indicate that CAM methods (trained on independent frames) can yield
decent localization accuracy. Our proposed TCAM method achieves a new
state-of-art in WSVOL accuracy, and visual results suggest that it can be
adapted for subsequent tasks like visual object tracking and detection. Code is
publicly available.
Related papers
- Weakly Supervised Video Anomaly Detection and Localization with Spatio-Temporal Prompts [57.01985221057047]
This paper introduces a novel method that learnstemporal prompt embeddings for weakly supervised video anomaly detection and localization (WSVADL) based on pre-trained vision-language models (VLMs)
Our method achieves state-of-theart performance on three public benchmarks for the WSVADL task.
arXiv Detail & Related papers (2024-08-12T03:31:29Z) - SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method.
We distribute features of space-time tubes evenly across a limited number of learnable clusters.
Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z) - Leveraging Transformers for Weakly Supervised Object Localization in Unconstrained Videos [12.762698438702854]
State-of-the-art WSVOL methods rely on class activation mapping (CAM)
Our TrCAM-V method allows training a localization network by sampling pseudo-pixels on the fly from these regions.
During inference, the model can process individual frames for real-time localization applications.
arXiv Detail & Related papers (2024-07-08T15:08:41Z) - Harnessing Large Language Models for Training-free Video Anomaly Detection [34.76811491190446]
Video anomaly detection (VAD) aims to temporally locate abnormal events in a video.
Training-based methods are prone to be domain-specific, thus being costly for practical deployment.
We propose LAnguage-based VAD (LAVAD), a method tackling VAD in a novel, training-free paradigm.
arXiv Detail & Related papers (2024-04-01T09:34:55Z) - UnLoc: A Unified Framework for Video Localization Tasks [82.59118972890262]
UnLoc is a new approach for temporal localization in untrimmed videos.
It uses pretrained image and text towers, and feeds tokens to a video-text fusion model.
We achieve state of the art results on all three different localization tasks with a unified approach.
arXiv Detail & Related papers (2023-08-21T22:15:20Z) - AutoDecoding Latent 3D Diffusion Models [95.7279510847827]
We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core.
The 3D autodecoder framework embeds properties learned from the target dataset in the latent space.
We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations.
arXiv Detail & Related papers (2023-07-07T17:59:14Z) - Video alignment using unsupervised learning of local and global features [0.0]
We introduce an unsupervised method for alignment that uses global and local features of the frames.
In particular, we introduce effective features for each video frame by means of three machine vision tools: person detection, pose estimation, and VGG network.
The main advantage of our approach is that no training is required, which makes it applicable for any new type of action without any need to collect training samples for it.
arXiv Detail & Related papers (2023-04-13T22:20:54Z) - CoLo-CAM: Class Activation Mapping for Object Co-Localization in
Weakly-Labeled Unconstrained Videos [23.447026400051772]
Co-Localization-CAM method exploitstemporal information in activation maps during training without constraining an object's position.
Co-Localization improves localization performance because the joint learning creates direct communication among pixels across all image locations.
arXiv Detail & Related papers (2023-03-16T02:29:53Z) - Unsupervised Pre-training for Temporal Action Localization Tasks [76.01985780118422]
We propose a self-supervised pretext task, coined as Pseudo Action localization (PAL) to Unsupervisedly Pre-train feature encoders for Temporal Action localization tasks (UP-TAL)
Specifically, we first randomly select temporal regions, each of which contains multiple clips, from one video as pseudo actions and then paste them onto different temporal positions of the other two videos.
The pretext task is to align the features of pasted pseudo action regions from two synthetic videos and maximize the agreement between them.
arXiv Detail & Related papers (2022-03-25T12:13:43Z) - Implicit Motion Handling for Video Camouflaged Object Detection [60.98467179649398]
We propose a new video camouflaged object detection (VCOD) framework.
It can exploit both short-term and long-term temporal consistency to detect camouflaged objects from video frames.
arXiv Detail & Related papers (2022-03-14T17:55:41Z) - F-CAM: Full Resolution CAM via Guided Parametric Upscaling [20.609010268320013]
Class Activation Mapping (CAM) methods have recently gained much attention for weakly-supervised object localization (WSOL) tasks.
CAM methods are typically integrated within off-the-shelf CNN backbones, such as ResNet50.
We introduce a generic method for parametric upscaling of CAMs that allows constructing accurate full resolution CAMs.
arXiv Detail & Related papers (2021-09-15T04:45:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.