Contrast-Unity for Partially-Supervised Temporal Sentence Grounding
- URL: http://arxiv.org/abs/2502.12917v1
- Date: Tue, 18 Feb 2025 14:59:18 GMT
- Title: Contrast-Unity for Partially-Supervised Temporal Sentence Grounding
- Authors: Haicheng Wang, Chen Ju, Weixiong Lin, Chaofan Ma, Shuai Xiao, Ya Zhang, Yanfeng Wang,
- Abstract summary: This paper introduces an intermediate partially-supervised setting, i.e., only short-clip is available during training.
To make full use of partial labels, we specially design one contrast-unity framework, with the two-stage goal of implicit-explicit progressive grounding.
In the explicit stage, to explicitly optimize grounding objectives, we train one fully-supervised model using obtained pseudo-labels for grounding refinement and denoising.
- Score: 31.12361511030561
- License:
- Abstract: Temporal sentence grounding aims to detect event timestamps described by the natural language query from given untrimmed videos. The existing fully-supervised setting achieves great results but requires expensive annotation costs; while the weakly-supervised setting adopts cheap labels but performs poorly. To pursue high performance with less annotation costs, this paper introduces an intermediate partially-supervised setting, i.e., only short-clip is available during training. To make full use of partial labels, we specially design one contrast-unity framework, with the two-stage goal of implicit-explicit progressive grounding. In the implicit stage, we align event-query representations at fine granularity using comprehensive quadruple contrastive learning: event-query gather, event-background separation, intra-cluster compactness and inter-cluster separability. Then, high-quality representations bring acceptable grounding pseudo-labels. In the explicit stage, to explicitly optimize grounding objectives, we train one fully-supervised model using obtained pseudo-labels for grounding refinement and denoising. Extensive experiments and thoroughly ablations on Charades-STA and ActivityNet Captions demonstrate the significance of partial supervision, as well as our superior performance.
Related papers
- Dynamic Contrastive Learning for Time Series Representation [6.086030037869592]
We propose DynaCL, an unsupervised contrastive representation learning framework for time series.
We demonstrate that DynaCL embeds instances from time series into semantically meaningful clusters.
Our findings also reveal that high scores on unsupervised clustering metrics do not guarantee that the representations are useful in downstream tasks.
arXiv Detail & Related papers (2024-10-20T15:20:24Z) - Constraint and Union for Partially-Supervised Temporal Sentence
Grounding [70.70385299135916]
temporal sentence grounding aims to detect the event timestamps described by the natural language query from given untrimmed videos.
The existing fully-supervised setting achieves great performance but requires expensive annotation costs.
This paper introduces an intermediate partially-supervised setting, i.e., only short-clip or even single-frame labels are available during training.
arXiv Detail & Related papers (2023-02-20T09:14:41Z) - Temporal Transductive Inference for Few-Shot Video Object Segmentation [27.140141181513425]
Few-shot object segmentation (FS-VOS) aims at segmenting video frames using a few labelled examples of classes not seen during initial training.
Key to our approach is the use of both global and local temporal constraints.
Empirically, our model outperforms state-of-the-art meta-learning approaches in terms of mean intersection over union on YouTube-VIS by 2.8%.
arXiv Detail & Related papers (2022-03-27T14:08:30Z) - Self-supervised Learning for Semi-supervised Temporal Language Grounding [84.11582376377471]
Temporal Language Grounding (TLG) aims to localize temporal boundaries of the segments that contain the specified semantics in an untrimmed video.
Previous works either tackle this task in a fully-supervised setting that requires a large amount of manual annotations or in a weakly supervised setting that cannot achieve satisfactory performance.
To achieve good performance with limited annotations, we tackle this task in a semi-supervised way and propose a unified Semi-supervised Temporal Language Grounding (STLG) framework.
arXiv Detail & Related papers (2021-09-23T16:29:16Z) - Semi-supervised Semantic Segmentation with Directional Context-aware
Consistency [66.49995436833667]
We focus on the semi-supervised segmentation problem where only a small set of labeled data is provided with a much larger collection of totally unlabeled images.
A preferred high-level representation should capture the contextual information while not losing self-awareness.
We present the Directional Contrastive Loss (DC Loss) to accomplish the consistency in a pixel-to-pixel manner.
arXiv Detail & Related papers (2021-06-27T03:42:40Z) - Reinforcement Learning for Weakly Supervised Temporal Grounding of
Natural Language in Untrimmed Videos [134.78406021194985]
We focus on the weakly supervised setting of this task that merely accesses to coarse video-level language description annotation without temporal boundary.
We propose a emphBoundary Adaptive Refinement (BAR) framework that resorts to reinforcement learning to guide the process of progressively refining the temporal boundary.
arXiv Detail & Related papers (2020-09-18T03:32:47Z) - Improving Semantic Segmentation via Self-Training [75.07114899941095]
We show that we can obtain state-of-the-art results using a semi-supervised approach, specifically a self-training paradigm.
We first train a teacher model on labeled data, and then generate pseudo labels on a large set of unlabeled data.
Our robust training framework can digest human-annotated and pseudo labels jointly and achieve top performances on Cityscapes, CamVid and KITTI datasets.
arXiv Detail & Related papers (2020-04-30T17:09:17Z) - Look Closer to Ground Better: Weakly-Supervised Temporal Grounding of
Sentence in Video [53.69956349097428]
Given an untrimmed video and a query sentence, our goal is to localize a temporal segment in the video that semantically corresponds to the query sentence.
We propose a two-stage model to tackle this problem in a coarse-to-fine manner.
arXiv Detail & Related papers (2020-01-25T13:07:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.