Related papers: CoLo-CAM: Class Activation Mapping for Object Co-Localization in Weakly-Labeled Unconstrained Videos

CoLo-CAM: Class Activation Mapping for Object Co-Localization in Weakly-Labeled Unconstrained Videos

URL: http://arxiv.org/abs/2303.09044v4
Date: Wed, 28 Feb 2024 13:53:28 GMT
Title: CoLo-CAM: Class Activation Mapping for Object Co-Localization in Weakly-Labeled Unconstrained Videos
Authors: Soufiane Belharbi, Shakeeb Murtaza, Marco Pedersoli, Ismail Ben Ayed, Luke McCaffrey, Eric Granger
Abstract summary: Co-Localization-CAM method exploitstemporal information in activation maps during training without constraining an object's position. Co-Localization improves localization performance because the joint learning creates direct communication among pixels across all image locations.
Score: 23.447026400051772
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Leveraging spatiotemporal information in videos is critical for weakly supervised video object localization (WSVOL) tasks. However, state-of-the-art methods only rely on visual and motion cues, while discarding discriminative information, making them susceptible to inaccurate localizations. Recently, discriminative models have been explored for WSVOL tasks using a temporal class activation mapping (CAM) method. Although their results are promising, objects are assumed to have limited movement from frame to frame, leading to degradation in performance for relatively long-term dependencies. This paper proposes a novel CAM method for WSVOL that exploits spatiotemporal information in activation maps during training without constraining an object's position. Its training relies on Co-Localization, hence, the name CoLo-CAM. Given a sequence of frames, localization is jointly learned based on color cues extracted across the corresponding maps, by assuming that an object has similar color in consecutive frames. CAM activations are constrained to respond similarly over pixels with similar colors, achieving co-localization. This improves localization performance because the joint learning creates direct communication among pixels across all image locations and over all frames, allowing for transfer, aggregation, and correction of localizations. Co-localization is integrated into training by minimizing the color term of a conditional random field (CRF) loss over a sequence of frames/CAMs. Extensive experiments on two challenging YouTube-Objects datasets of unconstrained videos show the merits of our CoLo-CAM method, and its robustness to long-term dependencies, leading to new state-of-the-art performance for WSVOL task.

Related papers

PixelCAM: Pixel Class Activation Mapping for Histology Image Classification and ROI Localization [7.869923456842283]
Weakly supervised object localization (WSOL) methods allow training models to classify images and localize ROIs. Standard WSOL methods rely on class activation mapping (CAM) methods to produce spatial localization maps according to a single- or two-step strategy. We propose PixelCAM, a cost-effective foreground/background pixel-wise classifier in the pixel-feature space that allows for spatial object localization.
arXiv Detail & Related papers (2025-03-31T14:18:01Z)
Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks. Current VLMs lack a fundamental cognitive ability: learning to localize objects in a scene by taking into account the context. This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z)
Leveraging Transformers for Weakly Supervised Object Localization in Unconstrained Videos [12.762698438702854]
State-of-the-art WSVOL methods rely on class activation mapping (CAM) Our TrCAM-V method allows training a localization network by sampling pseudo-pixels on the fly from these regions. During inference, the model can process individual frames for real-time localization applications.
arXiv Detail & Related papers (2024-07-08T15:08:41Z)
Weakly-Supervised Temporal Action Localization with Bidirectional Semantic Consistency Constraint [83.36913240873236]
Weakly Supervised Temporal Action localization (WTAL) aims to classify and localize temporal boundaries of actions for the video. We propose a simple yet efficient method, named bidirectional semantic consistency constraint (Bi- SCC) to discriminate the positive actions from co-scene actions. Experimental results show that our approach outperforms the state-of-the-art methods on THUMOS14 and ActivityNet.
arXiv Detail & Related papers (2023-04-25T07:20:33Z)
A Unified Framework for Event-based Frame Interpolation with Ad-hoc Deblurring in the Wild [72.0226493284814]
We propose a unified framework for event-based frame that performs deblurring ad-hoc. Our network consistently outperforms previous state-of-the-art methods on frame, single image deblurring, and the joint task of both.
arXiv Detail & Related papers (2023-01-12T18:19:00Z)
Attention-based Class Activation Diffusion for Weakly-Supervised Semantic Segmentation [98.306533433627]
extracting class activation maps (CAM) is a key step for weakly-supervised semantic segmentation (WSSS) This paper proposes a new method to couple CAM and Attention matrix in a probabilistic Diffusion way, and dub it AD-CAM. Experiments show that AD-CAM as pseudo labels can yield stronger WSSS models than the state-of-the-art variants of CAM.
arXiv Detail & Related papers (2022-11-20T10:06:32Z)
TCAM: Temporal Class Activation Maps for Object Localization in Weakly-Labeled Unconstrained Videos [22.271760669551817]
Weakly supervised object localization (WSVOL) allows object locating in videos using only global video tags as such object class. In this paper, we leverage the successful class activation mapping (CAM) methods, designed for WSOL based on still images. A new Temporal CAM (TCAM) method is introduced to train ariminant deep learning (DL) model to exploittemporal information in videos.
arXiv Detail & Related papers (2022-08-30T21:20:34Z)
CREAM: Weakly Supervised Object Localization via Class RE-Activation Mapping [18.67907876709536]
Class RE-Activation Mapping (CREAM) is a clustering-based approach to boost the activation values of the integral object regions. CREAM achieves the state-of-the-art performance on CUB, ILSVRC and OpenImages benchmark datasets.
arXiv Detail & Related papers (2022-05-27T11:57:41Z)
Exploring Intra- and Inter-Video Relation for Surgical Semantic Scene Segmentation [58.74791043631219]
We propose a novel framework STswinCL that explores the complementary intra- and inter-video relations to boost segmentation performance. We extensively validate our approach on two public surgical video benchmarks, including EndoVis18 Challenge and CaDIS dataset. Experimental results demonstrate the promising performance of our method, which consistently exceeds previous state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-29T05:52:23Z)
Locality-Aware Inter-and Intra-Video Reconstruction for Self-Supervised Correspondence Learning [74.03651142051656]
We develop LIIR, a locality-aware inter-and intra-video reconstruction framework. We exploit cross video affinities as extra negative samples within a unified, inter-and intra-video reconstruction scheme.
arXiv Detail & Related papers (2022-03-27T15:46:42Z)
F-CAM: Full Resolution CAM via Guided Parametric Upscaling [20.609010268320013]
Class Activation Mapping (CAM) methods have recently gained much attention for weakly-supervised object localization (WSOL) tasks. CAM methods are typically integrated within off-the-shelf CNN backbones, such as ResNet50. We introduce a generic method for parametric upscaling of CAMs that allows constructing accurate full resolution CAMs.
arXiv Detail & Related papers (2021-09-15T04:45:20Z)
TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised Object Localization [112.46381729542658]
Weakly supervised object localization (WSOL) is a challenging problem when given image category labels. We introduce the token semantic coupled attention map (TS-CAM) to take full advantage of the self-attention mechanism in visual transformer for long-range dependency extraction.
arXiv Detail & Related papers (2021-03-27T09:43:16Z)
Inter-Image Communication for Weakly Supervised Localization [77.2171924626778]
Weakly supervised localization aims at finding target object regions using only image-level supervision. We propose to leverage pixel-level similarities across different objects for learning more accurate object locations. Our method achieves the Top-1 localization error rate of 45.17% on the ILSVRC validation set.
arXiv Detail & Related papers (2020-08-12T04:14:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.