BID: Boundary-Interior Decoding for Unsupervised Temporal Action
Localization Pre-Trainin
- URL: http://arxiv.org/abs/2403.07354v1
- Date: Tue, 12 Mar 2024 06:23:45 GMT
- Title: BID: Boundary-Interior Decoding for Unsupervised Temporal Action
Localization Pre-Trainin
- Authors: Qihang Fang and Chengcheng Tang and Shugao Ma and Yanchao Yang
- Abstract summary: We propose the first unsupervised pre-training framework that partitions a skeleton-based motion sequence into semantically meaningful pre-action segments.
By fine-tuning our pre-training network with a small number of annotated data, we show results out-performing SOTA methods by a large margin.
- Score: 13.273908640951252
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Skeleton-based motion representations are robust for action localization and
understanding for their invariance to perspective, lighting, and occlusion,
compared with images. Yet, they are often ambiguous and incomplete when taken
out of context, even for human annotators. As infants discern gestures before
associating them with words, actions can be conceptualized before being
grounded with labels. Therefore, we propose the first unsupervised pre-training
framework, Boundary-Interior Decoding (BID), that partitions a skeleton-based
motion sequence into discovered semantically meaningful pre-action segments. By
fine-tuning our pre-training network with a small number of annotated data, we
show results out-performing SOTA methods by a large margin.
Related papers
- Provable Ordering and Continuity in Vision-Language Pretraining for Generalizable Embodied Agents [39.95793203302782]
Pre-training vision-language representations on human action videos has emerged as a promising approach to reduce reliance on large-scale expert demonstrations for training embodied agents.
We propose Action Temporal Coherence Learning (AcTOL) to learn ordered and continuous vision-language representations without rigid goal-based constraint.
arXiv Detail & Related papers (2025-02-03T10:16:49Z) - FinePseudo: Improving Pseudo-Labelling through Temporal-Alignablity for Semi-Supervised Fine-Grained Action Recognition [57.17966905865054]
Real-life applications of action recognition often require a fine-grained understanding of subtle movements.
Existing semi-supervised action recognition has mainly focused on coarse-grained action recognition.
We propose an Alignability-Verification-based Metric learning technique to effectively discriminate between fine-grained action pairs.
arXiv Detail & Related papers (2024-09-02T20:08:06Z) - Causal Unsupervised Semantic Segmentation [60.178274138753174]
Unsupervised semantic segmentation aims to achieve high-quality semantic grouping without human-labeled annotations.
We propose a novel framework, CAusal Unsupervised Semantic sEgmentation (CAUSE), which leverages insights from causal inference.
arXiv Detail & Related papers (2023-10-11T10:54:44Z) - Rewrite Caption Semantics: Bridging Semantic Gaps for
Language-Supervised Semantic Segmentation [100.81837601210597]
We propose Concept Curation (CoCu) to bridge the gap between visual and textual semantics in pre-training data.
CoCu achieves superb zero-shot transfer performance and greatly boosts language-supervised segmentation baseline by a large margin.
arXiv Detail & Related papers (2023-09-24T00:05:39Z) - ZScribbleSeg: Zen and the Art of Scribble Supervised Medical Image
Segmentation [16.188681108101196]
We propose to utilize solely scribble annotations for weakly supervised segmentation.
Existing solutions mainly leverage selective losses computed solely on annotated areas.
We introduce regularization terms to encode the spatial relationship and shape prior.
We integrate the efficient scribble supervision with the prior into a unified framework, denoted as ZScribbleSeg.
arXiv Detail & Related papers (2023-01-12T09:00:40Z) - SegTAD: Precise Temporal Action Detection via Semantic Segmentation [65.01826091117746]
We formulate the task of temporal action detection in a novel perspective of semantic segmentation.
Owing to the 1-dimensional property of TAD, we are able to convert the coarse-grained detection annotations to fine-grained semantic segmentation annotations for free.
We propose an end-to-end framework SegTAD composed of a 1D semantic segmentation network (1D-SSN) and a proposal detection network (PDN)
arXiv Detail & Related papers (2022-03-03T06:52:13Z) - Boundary-aware Self-supervised Learning for Video Scene Segmentation [20.713635723315527]
Video scene segmentation is a task of temporally localizing scene boundaries in a video.
We introduce three novel boundary-aware pretext tasks: Shot-Scene Matching, Contextual Group Matching and Pseudo-boundary Prediction.
We achieve the new state-of-the-art on the MovieNet-SSeg benchmark.
arXiv Detail & Related papers (2022-01-14T02:14:07Z) - Towards Tokenized Human Dynamics Representation [41.75534387530019]
We study how to segment and cluster videos into recurring temporal patterns in a self-supervised way.
We evaluate the frame-wise representation learning step by Kendall's Tau and the lexicon building step by normalized mutual information and language entropy.
On the AIST++ and PKU-MMD datasets, actons bring significant performance improvements compared to several baselines.
arXiv Detail & Related papers (2021-11-22T18:59:58Z) - Weakly Supervised Temporal Adjacent Network for Language Grounding [96.09453060585497]
We introduce a novel weakly supervised temporal adjacent network (WSTAN) for temporal language grounding.
WSTAN learns cross-modal semantic alignment by exploiting temporal adjacent network in a multiple instance learning (MIL) paradigm.
An additional self-discriminating loss is devised on both the MIL branch and the complementary branch, aiming to enhance semantic discrimination by self-supervising.
arXiv Detail & Related papers (2021-06-30T15:42:08Z) - Self-Supervision by Prediction for Object Discovery in Videos [62.87145010885044]
In this paper, we use the prediction task as self-supervision and build a novel object-centric model for image sequence representation.
Our framework can be trained without the help of any manual annotation or pretrained network.
Initial experiments confirm that the proposed pipeline is a promising step towards object-centric video prediction.
arXiv Detail & Related papers (2021-03-09T19:14:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.