CoLo-CAM: Class Activation Mapping for Object Co-Localization in Weakly-Labeled Unconstrained Videos
- URL: http://arxiv.org/abs/2303.09044v5
- Date: Sat, 25 Jan 2025 22:38:56 GMT
- Title: CoLo-CAM: Class Activation Mapping for Object Co-Localization in Weakly-Labeled Unconstrained Videos
- Authors: Soufiane Belharbi, Shakeeb Murtaza, Marco Pedersoli, Ismail Ben Ayed, Luke McCaffrey, Eric Granger,
- Abstract summary: temporal information in videos is critical for weakly supervised object localization (WSVOL) tasks.
This paper proposes a novel CAM method for WSVOL that exploitstemporal information in activation maps during training without constraining an object's position.
Co-Localization is integrated into training by minimizing the color term of a conditional random field (CRF) loss over a sequence of frames/CAMs.
- Score: 22.12785608546199
- License:
- Abstract: Leveraging spatiotemporal information in videos is critical for weakly supervised video object localization (WSVOL) tasks. However, state-of-the-art methods only rely on visual and motion cues, while discarding discriminative information, making them susceptible to inaccurate localizations. Recently, discriminative models have been explored for WSVOL tasks using a temporal class activation mapping (CAM) method. Although their results are promising, objects are assumed to have limited movement from frame to frame, leading to degradation in performance for relatively long-term dependencies. This paper proposes a novel CAM method for WSVOL that exploits spatiotemporal information in activation maps during training without constraining an object's position. Its training relies on Co-Localization, hence, the name CoLo-CAM. Given a sequence of frames, localization is jointly learned based on color cues extracted across the corresponding maps, by assuming that an object has similar color in consecutive frames. CAM activations are constrained to respond similarly over pixels with similar colors, achieving co-localization. This improves localization performance because the joint learning creates direct communication among pixels across all image locations and over all frames, allowing for transfer, aggregation, and correction of localizations. Co-localization is integrated into training by minimizing the color term of a conditional random field (CRF) loss over a sequence of frames/CAMs. Extensive experiments on two challenging YouTube-Objects datasets of unconstrained videos show the merits of our method, and its robustness to long-term dependencies, leading to new state-of-the-art performance for WSVOL task.
Related papers
- Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks.
Current VLMs lack a fundamental cognitive ability: learning to localize objects in a scene by taking into account the context.
This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z) - Leveraging Transformers for Weakly Supervised Object Localization in Unconstrained Videos [12.762698438702854]
State-of-the-art WSVOL methods rely on class activation mapping (CAM)
Our TrCAM-V method allows training a localization network by sampling pseudo-pixels on the fly from these regions.
During inference, the model can process individual frames for real-time localization applications.
arXiv Detail & Related papers (2024-07-08T15:08:41Z) - Weakly-Supervised Temporal Action Localization with Bidirectional
Semantic Consistency Constraint [83.36913240873236]
Weakly Supervised Temporal Action localization (WTAL) aims to classify and localize temporal boundaries of actions for the video.
We propose a simple yet efficient method, named bidirectional semantic consistency constraint (Bi- SCC) to discriminate the positive actions from co-scene actions.
Experimental results show that our approach outperforms the state-of-the-art methods on THUMOS14 and ActivityNet.
arXiv Detail & Related papers (2023-04-25T07:20:33Z) - A Unified Framework for Event-based Frame Interpolation with Ad-hoc Deblurring in the Wild [72.0226493284814]
We propose a unified framework for event-based frame that performs deblurring ad-hoc.
Our network consistently outperforms previous state-of-the-art methods on frame, single image deblurring, and the joint task of both.
arXiv Detail & Related papers (2023-01-12T18:19:00Z) - TCAM: Temporal Class Activation Maps for Object Localization in
Weakly-Labeled Unconstrained Videos [22.271760669551817]
Weakly supervised object localization (WSVOL) allows object locating in videos using only global video tags as such object class.
In this paper, we leverage the successful class activation mapping (CAM) methods, designed for WSOL based on still images.
A new Temporal CAM (TCAM) method is introduced to train ariminant deep learning (DL) model to exploittemporal information in videos.
arXiv Detail & Related papers (2022-08-30T21:20:34Z) - CREAM: Weakly Supervised Object Localization via Class RE-Activation
Mapping [18.67907876709536]
Class RE-Activation Mapping (CREAM) is a clustering-based approach to boost the activation values of the integral object regions.
CREAM achieves the state-of-the-art performance on CUB, ILSVRC and OpenImages benchmark datasets.
arXiv Detail & Related papers (2022-05-27T11:57:41Z) - Locality-Aware Inter-and Intra-Video Reconstruction for Self-Supervised
Correspondence Learning [74.03651142051656]
We develop LIIR, a locality-aware inter-and intra-video reconstruction framework.
We exploit cross video affinities as extra negative samples within a unified, inter-and intra-video reconstruction scheme.
arXiv Detail & Related papers (2022-03-27T15:46:42Z) - Keep CALM and Improve Visual Feature Attribution [42.784665606132]
The class activation mapping, or CAM, has been the cornerstone of feature attribution methods for multiple vision tasks.
We improve CAM by explicitly incorporating a latent variable encoding the location of the cue for recognition in the formulation.
The resulting model, class activation latent mapping, or CALM, is trained with the expectation-maximization algorithm.
arXiv Detail & Related papers (2021-06-15T03:33:25Z) - TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised
Object Localization [112.46381729542658]
Weakly supervised object localization (WSOL) is a challenging problem when given image category labels.
We introduce the token semantic coupled attention map (TS-CAM) to take full advantage of the self-attention mechanism in visual transformer for long-range dependency extraction.
arXiv Detail & Related papers (2021-03-27T09:43:16Z) - Inter-Image Communication for Weakly Supervised Localization [77.2171924626778]
Weakly supervised localization aims at finding target object regions using only image-level supervision.
We propose to leverage pixel-level similarities across different objects for learning more accurate object locations.
Our method achieves the Top-1 localization error rate of 45.17% on the ILSVRC validation set.
arXiv Detail & Related papers (2020-08-12T04:14:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.