Related papers: CLIP-AE: CLIP-assisted Cross-view Audio-Visual Enhancement for Unsupervised Temporal Action Localization

CLIP-AE: CLIP-assisted Cross-view Audio-Visual Enhancement for Unsupervised Temporal Action Localization

URL: http://arxiv.org/abs/2505.23524v2
Date: Wed, 04 Jun 2025 19:18:31 GMT
Title: CLIP-AE: CLIP-assisted Cross-view Audio-Visual Enhancement for Unsupervised Temporal Action Localization
Authors: Rui Xia, Dan Jiang, Quan Zhang, Ke Zhang, Chun Yuan,
Abstract summary: unsupervised temporal action localization (UTAL) has gained popularity.<n>Current methods face two main challenges: 1) Classification pre-trained features overly focus on highly discriminative regions; 2) Solely relying on visual modality information makes it difficult to determine contextual boundaries.<n>We propose a CLIP-assisted cross-view audiovisual enhanced UTAL method.
Score: 53.89574102984098
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Temporal Action Localization (TAL) has garnered significant attention in information retrieval. Existing supervised or weakly supervised methods heavily rely on labeled temporal boundaries and action categories, which are labor-intensive and time-consuming. Consequently, unsupervised temporal action localization (UTAL) has gained popularity. However, current methods face two main challenges: 1) Classification pre-trained features overly focus on highly discriminative regions; 2) Solely relying on visual modality information makes it difficult to determine contextual boundaries. To address these issues, we propose a CLIP-assisted cross-view audiovisual enhanced UTAL method. Specifically, we introduce visual language pre-training (VLP) and classification pre-training-based collaborative enhancement to avoid excessive focus on highly discriminative regions; we also incorporate audio perception to provide richer contextual boundary information. Finally, we introduce a self-supervised cross-view learning paradigm to achieve multi-view perceptual enhancement without additional annotations. Extensive experiments on two public datasets demonstrate our model's superiority over several state-of-the-art competitors.

Related papers

ChordPrompt: Orchestrating Cross-Modal Prompt Synergy for Multi-Domain Incremental Learning in CLIP [12.031278034659872]
Continual learning empowers pre-trained vision-language models to adapt effectively to novel or previously underrepresented data distributions.<n>ChordPrompt introduces cross-modal prompts to leverage interactions between visual and textual information.<n>ChordPrompt outperforms state-of-the-art methods in zero-shot generalization and downstream task performance.
arXiv Detail & Related papers (2025-06-24T13:22:06Z)
Cyclic Contrastive Knowledge Transfer for Open-Vocabulary Object Detection [11.497620257835964]
We propose CCKT-Det trained without any extra supervision.<n>The proposed framework constructs a cyclic and dynamic knowledge transfer from language queries and visual region features extracted from vision-language models (VLMs)<n> CCKT-Det can consistently improve performance as the scale of VLMs increases, all while requiring the detector at a moderate level of overhead.
arXiv Detail & Related papers (2025-03-14T02:04:28Z)
Open-Vocabulary Temporal Action Localization using Multimodal Guidance [67.09635853019005]
OVTAL enables a model to recognize any desired action category in videos without the need to explicitly curate training data for all categories. This flexibility poses significant challenges, as the model must recognize not only the action categories seen during training but also novel categories specified at inference. We introduce OVFormer, a novel open-vocabulary framework extending ActionFormer with three key contributions.
arXiv Detail & Related papers (2024-06-21T18:00:05Z)
Anomaly Detection by Adapting a pre-trained Vision Language Model [48.225404732089515]
We present a unified framework named CLIP-ADA for Anomaly Detection by Adapting a pre-trained CLIP model. We introduce the learnable prompt and propose to associate it with abnormal patterns through self-supervised learning. We achieve the state-of-the-art 97.5/55.6 and 89.3/33.1 on MVTec-AD and VisA for anomaly detection and localization.
arXiv Detail & Related papers (2024-03-14T15:35:07Z)
Visual Self-paced Iterative Learning for Unsupervised Temporal Action Localization [50.48350210022611]
We present a novel self-paced iterative learning model to enhance clustering and localization training simultaneously.<n>We design two (constant- and variable- speed) incremental instance learning strategies for easy-to-hard model training, thus ensuring the reliability of these video pseudolabels.
arXiv Detail & Related papers (2023-12-12T16:00:55Z)
Global Knowledge Calibration for Fast Open-Vocabulary Segmentation [124.74256749281625]
We introduce a text diversification strategy that generates a set of synonyms for each training category. We also employ a text-guided knowledge distillation method to preserve the generalizable knowledge of CLIP. Our proposed model achieves robust generalization performance across various datasets.
arXiv Detail & Related papers (2023-03-16T09:51:41Z)
Stacked Temporal Attention: Improving First-person Action Recognition by Emphasizing Discriminative Clips [39.29955809641396]
Many backgrounds or noisy frames in a first-person video can distract an action recognition model during its learning process. Previous works explored to address this problem by applying temporal attention but failed to consider the global context of the full video. We propose a simple yet effective Stacked Temporal Attention Module (STAM) to compute temporal attention based on the global knowledge across clips.
arXiv Detail & Related papers (2021-12-02T08:02:35Z)
Weakly Supervised Temporal Action Localization Through Learning Explicit Subspaces for Action and Context [151.23835595907596]
Methods learn to localize temporal starts and ends of action instances in a video under only video-level supervision. We introduce a framework that learns two feature subspaces respectively for actions and their context. The proposed approach outperforms state-of-the-art WS-TAL methods on three benchmarks.
arXiv Detail & Related papers (2021-03-30T08:26:53Z)
Reinforcement Learning for Weakly Supervised Temporal Grounding of Natural Language in Untrimmed Videos [134.78406021194985]
We focus on the weakly supervised setting of this task that merely accesses to coarse video-level language description annotation without temporal boundary. We propose a emphBoundary Adaptive Refinement (BAR) framework that resorts to reinforcement learning to guide the process of progressively refining the temporal boundary.
arXiv Detail & Related papers (2020-09-18T03:32:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.