Revisiting Foreground and Background Separation in Weakly-supervised
Temporal Action Localization: A Clustering-based Approach
- URL: http://arxiv.org/abs/2312.14138v1
- Date: Thu, 21 Dec 2023 18:57:12 GMT
- Title: Revisiting Foreground and Background Separation in Weakly-supervised
Temporal Action Localization: A Clustering-based Approach
- Authors: Qinying Liu, Zilei Wang, Shenghai Rong, Junjie Li, Yixin Zhang
- Abstract summary: Weakly-supervised temporal action localization aims to localize action instances in videos with only video-level action labels.
We propose a novel clustering-based F&B separation algorithm.
We evaluate our method on three benchmarks: THUMOS14, ActivityNet v1.2 and v1.3.
- Score: 48.684550829098534
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Weakly-supervised temporal action localization aims to localize action
instances in videos with only video-level action labels. Existing methods
mainly embrace a localization-by-classification pipeline that optimizes the
snippet-level prediction with a video classification loss. However, this
formulation suffers from the discrepancy between classification and detection,
resulting in inaccurate separation of foreground and background (F\&B)
snippets. To alleviate this problem, we propose to explore the underlying
structure among the snippets by resorting to unsupervised snippet clustering,
rather than heavily relying on the video classification loss. Specifically, we
propose a novel clustering-based F\&B separation algorithm. It comprises two
core components: a snippet clustering component that groups the snippets into
multiple latent clusters and a cluster classification component that further
classifies the cluster as foreground or background. As there are no
ground-truth labels to train these two components, we introduce a unified
self-labeling mechanism based on optimal transport to produce high-quality
pseudo-labels that match several plausible prior distributions. This ensures
that the cluster assignments of the snippets can be accurately associated with
their F\&B labels, thereby boosting the F\&B separation. We evaluate our method
on three benchmarks: THUMOS14, ActivityNet v1.2 and v1.3. Our method achieves
promising performance on all three benchmarks while being significantly more
lightweight than previous methods. Code is available at
https://github.com/Qinying-Liu/CASE
Related papers
- Densify Your Labels: Unsupervised Clustering with Bipartite Matching for
Weakly Supervised Point Cloud Segmentation [42.144991202299934]
We propose a weakly supervised semantic segmentation method for point clouds that predicts "per-point" labels from just "whole-scene" annotations.
Our core idea is to propagate the scene-level labels to each point in the point cloud by creating pseudo labels in a conservative way.
We evaluate our method on ScanNet and S3DIS datasets, outperforming state of the art, and demonstrate that we can achieve results comparable to fully supervised methods.
arXiv Detail & Related papers (2023-12-11T19:18:17Z) - Generalized Category Discovery with Clustering Assignment Consistency [56.92546133591019]
Generalized category discovery (GCD) is a recently proposed open-world task.
We propose a co-training-based framework that encourages clustering consistency.
Our method achieves state-of-the-art performance on three generic benchmarks and three fine-grained visual recognition datasets.
arXiv Detail & Related papers (2023-10-30T00:32:47Z) - PDiscoNet: Semantically consistent part discovery for fine-grained
recognition [62.12602920807109]
We propose PDiscoNet to discover object parts by using only image-level class labels along with priors encouraging the parts to be.
Our results on CUB, CelebA, and PartImageNet show that the proposed method provides substantially better part discovery performance than previous methods.
arXiv Detail & Related papers (2023-09-06T17:19:29Z) - Contrastive Bootstrapping for Label Refinement [34.55195008779178]
We propose a lightweight contrastive clustering-based bootstrapping method to iteratively refine the labels of passages.
Experiments on NYT and 20News show that our method outperforms the state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2023-06-07T15:49:04Z) - Weakly-supervised Action Localization via Hierarchical Mining [76.00021423700497]
Weakly-supervised action localization aims to localize and classify action instances in the given videos temporally with only video-level categorical labels.
We propose a hierarchical mining strategy under video-level and snippet-level manners, i.e., hierarchical supervision and hierarchical consistency mining.
We show that HiM-Net outperforms existing methods on THUMOS14 and ActivityNet1.3 datasets with large margins by hierarchically mining the supervision and consistency.
arXiv Detail & Related papers (2022-06-22T12:19:09Z) - Exploring Category-correlated Feature for Few-shot Image Classification [27.13708881431794]
We present a simple yet effective feature rectification method by exploring the category correlation between novel and base classes as the prior knowledge.
The proposed approach consistently obtains considerable performance gains on three widely used benchmarks.
arXiv Detail & Related papers (2021-12-14T08:25:24Z) - Channel DropBlock: An Improved Regularization Method for Fine-Grained
Visual Classification [58.07257910065007]
Existing approaches mainly tackle this problem by introducing attention mechanisms to locate the discriminative parts or feature encoding approaches to extract the highly parameterized features in a weakly-supervised fashion.
In this work, we propose a lightweight yet effective regularization method named Channel DropBlock (CDB) in combination with two alternative correlation metrics, to address this problem.
arXiv Detail & Related papers (2021-06-07T09:03:02Z) - Predictive K-means with local models [0.028675177318965035]
Predictive clustering seeks to obtain the best of the two worlds.
We present two new algorithms using this technique and show on a variety of data sets that they are competitive for prediction performance.
arXiv Detail & Related papers (2020-12-16T10:49:36Z) - Fine-Grained Visual Classification with Efficient End-to-end
Localization [49.9887676289364]
We present an efficient localization module that can be fused with a classification network in an end-to-end setup.
We evaluate the new model on the three benchmark datasets CUB200-2011, Stanford Cars and FGVC-Aircraft.
arXiv Detail & Related papers (2020-05-11T14:07:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.