A Generalized & Robust Framework For Timestamp Supervision in Temporal
Action Segmentation
- URL: http://arxiv.org/abs/2207.10137v1
- Date: Wed, 20 Jul 2022 18:30:48 GMT
- Title: A Generalized & Robust Framework For Timestamp Supervision in Temporal
Action Segmentation
- Authors: Rahul Rahaman, Dipika Singhania, Alexandre Thiery and Angela Yao
- Abstract summary: In temporal action segmentation, Timestamp supervision requires only a handful of labelled frames per video sequence.
We propose a novel Expectation-Maximization based approach that leverages the label uncertainty of unlabelled frames.
Our proposed method produces SOTA results and even exceeds the fully-supervised setup in several metrics and datasets.
- Score: 79.436224998992
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In temporal action segmentation, Timestamp supervision requires only a
handful of labelled frames per video sequence. For unlabelled frames, previous
works rely on assigning hard labels, and performance rapidly collapses under
subtle violations of the annotation assumptions. We propose a novel
Expectation-Maximization (EM) based approach that leverages the label
uncertainty of unlabelled frames and is robust enough to accommodate possible
annotation errors. With accurate timestamp annotations, our proposed method
produces SOTA results and even exceeds the fully-supervised setup in several
metrics and datasets. When applied to timestamp annotations with missing action
segments, our method presents stable performance. To further test our
formulation's robustness, we introduce the new challenging annotation setup of
Skip-tag supervision. This setup relaxes constraints and requires annotations
of any fixed number of random frames in a video, making it more flexible than
Timestamp supervision while remaining competitive.
Related papers
- Constraint and Union for Partially-Supervised Temporal Sentence
Grounding [70.70385299135916]
temporal sentence grounding aims to detect the event timestamps described by the natural language query from given untrimmed videos.
The existing fully-supervised setting achieves great performance but requires expensive annotation costs.
This paper introduces an intermediate partially-supervised setting, i.e., only short-clip or even single-frame labels are available during training.
arXiv Detail & Related papers (2023-02-20T09:14:41Z) - Distill and Collect for Semi-Supervised Temporal Action Segmentation [0.0]
We propose an approach for the temporal action segmentation task that can simultaneously leverage knowledge from annotated and unannotated video sequences.
Our approach uses multi-stream distillation that repeatedly refines and finally combines their frame predictions.
Our model also predicts the action order, which is later used as a temporal constraint while estimating frames labels to counter the lack of supervision for unannotated videos.
arXiv Detail & Related papers (2022-11-02T17:34:04Z) - Robust Action Segmentation from Timestamp Supervision [18.671808549019833]
Action segmentation is the task of predicting an action label for each frame of an untrimmed video.
Timestamp supervision is a promising type of weak supervision as obtaining one timestamp per action is less expensive than annotating all frames.
We show that our approach is more robust to missing annotations compared to other approaches and various baselines.
arXiv Detail & Related papers (2022-10-12T18:01:14Z) - Unified Fully and Timestamp Supervised Temporal Action Segmentation via
Sequence to Sequence Translation [15.296933526770967]
This paper introduces a unified framework for video action segmentation via sequence to sequence (seq2seq) translation.
Our proposed method involves a series of modifications and auxiliary loss functions on the standard Transformer seq2seq translation model.
Our framework performs consistently on both fully and timestamp supervised settings, outperforming or competing state-of-the-art on several datasets.
arXiv Detail & Related papers (2022-09-01T17:46:02Z) - Video Activity Localisation with Uncertainties in Temporal Boundary [74.7263952414899]
Methods for video activity localisation over time assume implicitly that activity temporal boundaries are determined and precise.
In unscripted natural videos, different activities transit smoothly, so that it is intrinsically ambiguous to determine in labelling precisely when an activity starts and ends over time.
We introduce Elastic Moment Bounding (EMB) to accommodate flexible and adaptive activity temporal boundaries.
arXiv Detail & Related papers (2022-06-26T16:45:56Z) - Video Moment Retrieval from Text Queries via Single Frame Annotation [65.92224946075693]
Video moment retrieval aims at finding the start and end timestamps of a moment described by a given natural language query.
Fully supervised methods need complete temporal boundary annotations to achieve promising results.
We propose a new paradigm called "glance annotation"
arXiv Detail & Related papers (2022-04-20T11:59:17Z) - Temporal Action Segmentation from Timestamp Supervision [25.49797678477498]
We introduce timestamp supervision for the temporal action segmentation task.
Timestamps require a comparable annotation effort to weakly supervised approaches.
Our approach uses the model output and the annotated timestamps to generate frame-wise labels.
arXiv Detail & Related papers (2021-03-11T13:52:41Z) - Weakly Supervised Temporal Action Localization with Segment-Level Labels [140.68096218667162]
Temporal action localization presents a trade-off between test performance and annotation-time cost.
We introduce a new segment-level supervision setting: segments are labeled when annotators observe actions happening here.
We devise a partial segment loss regarded as a loss sampling to learn integral action parts from labeled segments.
arXiv Detail & Related papers (2020-07-03T10:32:19Z) - SF-Net: Single-Frame Supervision for Temporal Action Localization [60.202516362976645]
Single-frame supervision introduces extra temporal action signals while maintaining low annotation overhead.
We propose a unified system called SF-Net to make use of such single-frame supervision.
SF-Net significantly improves upon state-of-the-art weakly-supervised methods in terms of both segment localization and single-frame localization.
arXiv Detail & Related papers (2020-03-15T15:06:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.