Related papers: Convex Combination Consistency between Neighbors for Weakly-supervised Action Localization

Convex Combination Consistency between Neighbors for Weakly-supervised Action Localization

URL: http://arxiv.org/abs/2205.00400v3
Date: Fri, 3 May 2024 15:17:12 GMT
Title: Convex Combination Consistency between Neighbors for Weakly-supervised Action Localization
Authors: Qinying Liu, Zilei Wang, Ruoxi Chen, Zhilin Li,
Abstract summary: We propose a novel WTAL approach named Convex Combination Consistency between Neighbors (C$3$BN) C$3$BN consists of two key ingredients: a micro data augmentation strategy that increases the diversity in-between adjacent snippets, and a macro-micro consistency regularization. Experimental results demonstrate the effectiveness of C$3$BN on top of various baselines for WTAL with video-level and point-level supervisions.
Score: 26.63463867095924
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Weakly-supervised temporal action localization (WTAL) intends to detect action instances with only weak supervision, e.g., video-level labels. The current~\textit{de facto} pipeline locates action instances by thresholding and grouping continuous high-score regions on temporal class activation sequences. In this route, the capacity of the model to recognize the relationships between adjacent snippets is of vital importance which determines the quality of the action boundaries. However, it is error-prone since the variations between adjacent snippets are typically subtle, and unfortunately this is overlooked in the literature. To tackle the issue, we propose a novel WTAL approach named Convex Combination Consistency between Neighbors (C$^3$BN). C$^3$BN consists of two key ingredients: a micro data augmentation strategy that increases the diversity in-between adjacent snippets by convex combination of adjacent snippets, and a macro-micro consistency regularization that enforces the model to be invariant to the transformations~\textit{w.r.t.} video semantics, snippet predictions, and snippet representations. Consequently, fine-grained patterns in-between adjacent snippets are enforced to be explored, thereby resulting in a more robust action boundary localization. Experimental results demonstrate the effectiveness of C$^3$BN on top of various baselines for WTAL with video-level and point-level supervisions. Code is at https://github.com/Qinying-Liu/C3BN.

Related papers

Thinking with Images as Continuous Actions: Numerical Visual Chain-of-Thought [55.65577137924979]
We propose a framework that enables MLLMs to reason over images using continuous numerical coordinates.<n> NV-CoT expands the MLLM action space from discrete vocabulary tokens to a continuous Euclidean space.<n>Experiments on three benchmarks demonstrate that NV-CoT significantly improves localization precision and final answer accuracy.
arXiv Detail & Related papers (2026-02-27T12:04:07Z)
CIEC: Coupling Implicit and Explicit Cues for Multimodal Weakly Supervised Manipulation Localization [25.78477436147408]
Coupling Implicit and Explicit Cues (CIEC) aims to achieve multimodal weakly-supervised manipulation localization for image-text pairs.<n>It integrates forgery cues from both visual and textual perspectives to lock onto suspicious regions aided by spatial priors.<n>For the latter, we devise the Visual-deviation Calibrated Token Grounding (VCTG) module. It focuses on meaningful content words and leverages relative visual bias to assist token localization.
arXiv Detail & Related papers (2026-02-02T14:46:38Z)
Beyond BEV: Optimizing Point-Level Tokens for Collaborative Perception [17.654858416126093]
Collaborative perception allows agents to enhance their perceptual capabilities by exchanging intermediate features.<n>Existing methods typically organize these intermediate features as 2D bird's-eye-view (BEV) representations.<n>We present CoPLOT, a novel Collaborative perception framework that utilizes Point-Level Optimized Tokens.
arXiv Detail & Related papers (2025-08-27T07:27:42Z)
BCTR: Bidirectional Conditioning Transformer for Scene Graph Generation [4.977568882858193]
We propose a novel bidirectional conditioning factorization in a semantic-aligned space for Scene Graph Generation (SGG) We introduce an end-to-end scene graph generation model, the Bidirectional Conditioning Transformer (BCTR) BCTR consists of two key modules. First, the Bidirectional Conditioning Generator (BCG) performs multi-stage interactive feature augmentation between entities and predicates, enabling mutual enhancement between these predictions. Second, Random Feature Alignment (RFA) is present to regularize feature space by distilling multi-modal knowledge from pre-trained models. Within this regularized feature space, BCG is feasible to capture
arXiv Detail & Related papers (2024-07-26T13:02:48Z)
Coupled Laplacian Eigenmaps for Locally-Aware 3D Rigid Point Cloud Matching [0.0]
We propose a new technique, based on graph Laplacian eigenmaps, to match point clouds by taking into account fine local structures. To deal with the order and sign ambiguity of Laplacian eigenmaps, we introduce a new operator, called Coupled Laplacian. We show that the similarity between those aligned high-dimensional spaces provides a locally meaningful score to match shapes.
arXiv Detail & Related papers (2024-02-27T10:10:12Z)
Revisiting Foreground and Background Separation in Weakly-supervised Temporal Action Localization: A Clustering-based Approach [48.684550829098534]
Weakly-supervised temporal action localization aims to localize action instances in videos with only video-level action labels. We propose a novel clustering-based F&B separation algorithm. We evaluate our method on three benchmarks: THUMOS14, ActivityNet v1.2 and v1.3.
arXiv Detail & Related papers (2023-12-21T18:57:12Z)
Temporal Action Localization with Enhanced Instant Discriminability [66.76095239972094]
Temporal action detection (TAD) aims to detect all action boundaries and their corresponding categories in an untrimmed video. We propose a one-stage framework named TriDet to resolve imprecise predictions of action boundaries by existing methods. Experimental results demonstrate the robustness of TriDet and its state-of-the-art performance on multiple TAD datasets.
arXiv Detail & Related papers (2023-09-11T16:17:50Z)
Weakly-supervised Action Localization via Hierarchical Mining [76.00021423700497]
Weakly-supervised action localization aims to localize and classify action instances in the given videos temporally with only video-level categorical labels. We propose a hierarchical mining strategy under video-level and snippet-level manners, i.e., hierarchical supervision and hierarchical consistency mining. We show that HiM-Net outperforms existing methods on THUMOS14 and ActivityNet1.3 datasets with large margins by hierarchically mining the supervision and consistency.
arXiv Detail & Related papers (2022-06-22T12:19:09Z)
Fine-grained Temporal Contrastive Learning for Weakly-supervised Temporal Action Localization [87.47977407022492]
This paper argues that learning by contextually comparing sequence-to-sequence distinctions offers an essential inductive bias in weakly-supervised action localization. Under a differentiable dynamic programming formulation, two complementary contrastive objectives are designed, including Fine-grained Sequence Distance (FSD) contrasting and Longest Common Subsequence (LCS) contrasting. Our method achieves state-of-the-art performance on two popular benchmarks.
arXiv Detail & Related papers (2022-03-31T05:13:50Z)
Scope Head for Accurate Localization in Object Detection [135.9979405835606]
We propose a novel detector coined as ScopeNet, which models anchors of each location as a mutually dependent relationship. With our concise and effective design, the proposed ScopeNet achieves state-of-the-art results on COCO.
arXiv Detail & Related papers (2020-05-11T04:00:09Z)
Patch-level Neighborhood Interpolation: A General and Effective Graph-based Regularization Strategy [77.34280933613226]
We propose a general regularizer called textbfPatch-level Neighborhood Interpolation(Pani) that conducts a non-local representation in the computation of networks. Our proposal explicitly constructs patch-level graphs in different layers and then linearly interpolates neighborhood patch features, serving as a general and effective regularization strategy.
arXiv Detail & Related papers (2019-11-21T06:31:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.