Constraint and Union for Partially-Supervised Temporal Sentence
Grounding
- URL: http://arxiv.org/abs/2302.09850v1
- Date: Mon, 20 Feb 2023 09:14:41 GMT
- Title: Constraint and Union for Partially-Supervised Temporal Sentence
Grounding
- Authors: Chen Ju, Haicheng Wang, Jinxiang Liu, Chaofan Ma, Ya Zhang, Peisen
Zhao, Jianlong Chang, Qi Tian
- Abstract summary: temporal sentence grounding aims to detect the event timestamps described by the natural language query from given untrimmed videos.
The existing fully-supervised setting achieves great performance but requires expensive annotation costs.
This paper introduces an intermediate partially-supervised setting, i.e., only short-clip or even single-frame labels are available during training.
- Score: 70.70385299135916
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal sentence grounding aims to detect the event timestamps described by
the natural language query from given untrimmed videos. The existing
fully-supervised setting achieves great performance but requires expensive
annotation costs; while the weakly-supervised setting adopts cheap labels but
performs poorly. To pursue high performance with less annotation cost, this
paper introduces an intermediate partially-supervised setting, i.e., only
short-clip or even single-frame labels are available during training. To take
full advantage of partial labels, we propose a novel quadruple constraint
pipeline to comprehensively shape event-query aligned representations, covering
intra- and inter-samples, uni- and multi-modalities. The former raises
intra-cluster compactness and inter-cluster separability; while the latter
enables event-background separation and event-query gather. To achieve more
powerful performance with explicit grounding optimization, we further introduce
a partial-full union framework, i.e., bridging with an additional
fully-supervised branch, to enjoy its impressive grounding bonus, and be robust
to partial annotations. Extensive experiments and ablations on Charades-STA and
ActivityNet Captions demonstrate the significance of partial supervision and
our superior performance.
Related papers
- Timestamp-supervised Wearable-based Activity Segmentation and
Recognition with Contrastive Learning and Order-Preserving Optimal Transport [11.837401473598288]
We propose a novel method for joint activity segmentation and recognition with timestamp supervision.
The prototypes are estimated by class-activation maps to form a sample-to-prototype contrast module.
Comprehensive experiments on four public HAR datasets demonstrate that our model trained with timestamp supervision is superior to the state-of-the-art weakly-supervised methods.
arXiv Detail & Related papers (2023-10-13T14:00:49Z) - Proposal-based Temporal Action Localization with Point-level Supervision [29.98225940694062]
Point-level supervised temporal action localization (PTAL) aims at recognizing and localizing actions in untrimmed videos.
We propose a novel method that localizes actions by generating and evaluating action proposals of flexible duration.
Experiments show that our proposed method achieves competitive or superior performance to the state-of-the-art methods.
arXiv Detail & Related papers (2023-10-09T08:27:05Z) - Distill and Collect for Semi-Supervised Temporal Action Segmentation [0.0]
We propose an approach for the temporal action segmentation task that can simultaneously leverage knowledge from annotated and unannotated video sequences.
Our approach uses multi-stream distillation that repeatedly refines and finally combines their frame predictions.
Our model also predicts the action order, which is later used as a temporal constraint while estimating frames labels to counter the lack of supervision for unannotated videos.
arXiv Detail & Related papers (2022-11-02T17:34:04Z) - A Generalized & Robust Framework For Timestamp Supervision in Temporal
Action Segmentation [79.436224998992]
In temporal action segmentation, Timestamp supervision requires only a handful of labelled frames per video sequence.
We propose a novel Expectation-Maximization based approach that leverages the label uncertainty of unlabelled frames.
Our proposed method produces SOTA results and even exceeds the fully-supervised setup in several metrics and datasets.
arXiv Detail & Related papers (2022-07-20T18:30:48Z) - Cross-Sentence Temporal and Semantic Relations in Video Activity
Localisation [79.50868197788773]
We develop a more accurate weakly-supervised solution by introducing Cross-Sentence Relations Mining.
We explore two cross-sentence relational constraints: (1) trimmed ordering and (2) semantic consistency among sentences in a paragraph description of video activities.
Experiments on two publicly available activity localisation datasets show the advantages of our approach over the state-of-the-art weakly supervised methods.
arXiv Detail & Related papers (2021-07-23T20:04:01Z) - WSSOD: A New Pipeline for Weakly- and Semi-Supervised Object Detection [75.80075054706079]
We propose a weakly- and semi-supervised object detection framework (WSSOD)
An agent detector is first trained on a joint dataset and then used to predict pseudo bounding boxes on weakly-annotated images.
The proposed framework demonstrates remarkable performance on PASCAL-VOC and MSCOCO benchmark, achieving a high performance comparable to those obtained in fully-supervised settings.
arXiv Detail & Related papers (2021-05-21T11:58:50Z) - Reinforcement Learning for Weakly Supervised Temporal Grounding of
Natural Language in Untrimmed Videos [134.78406021194985]
We focus on the weakly supervised setting of this task that merely accesses to coarse video-level language description annotation without temporal boundary.
We propose a emphBoundary Adaptive Refinement (BAR) framework that resorts to reinforcement learning to guide the process of progressively refining the temporal boundary.
arXiv Detail & Related papers (2020-09-18T03:32:47Z) - Weakly Supervised Temporal Action Localization with Segment-Level Labels [140.68096218667162]
Temporal action localization presents a trade-off between test performance and annotation-time cost.
We introduce a new segment-level supervision setting: segments are labeled when annotators observe actions happening here.
We devise a partial segment loss regarded as a loss sampling to learn integral action parts from labeled segments.
arXiv Detail & Related papers (2020-07-03T10:32:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.