Unified Fully and Timestamp Supervised Temporal Action Segmentation via
Sequence to Sequence Translation
- URL: http://arxiv.org/abs/2209.00638v1
- Date: Thu, 1 Sep 2022 17:46:02 GMT
- Title: Unified Fully and Timestamp Supervised Temporal Action Segmentation via
Sequence to Sequence Translation
- Authors: Nadine Behrmann, S. Alireza Golestaneh, Zico Kolter, Juergen Gall,
Mehdi Noroozi
- Abstract summary: This paper introduces a unified framework for video action segmentation via sequence to sequence (seq2seq) translation.
Our proposed method involves a series of modifications and auxiliary loss functions on the standard Transformer seq2seq translation model.
Our framework performs consistently on both fully and timestamp supervised settings, outperforming or competing state-of-the-art on several datasets.
- Score: 15.296933526770967
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces a unified framework for video action segmentation via
sequence to sequence (seq2seq) translation in a fully and timestamp supervised
setup. In contrast to current state-of-the-art frame-level prediction methods,
we view action segmentation as a seq2seq translation task, i.e., mapping a
sequence of video frames to a sequence of action segments. Our proposed method
involves a series of modifications and auxiliary loss functions on the standard
Transformer seq2seq translation model to cope with long input sequences opposed
to short output sequences and relatively few videos. We incorporate an
auxiliary supervision signal for the encoder via a frame-wise loss and propose
a separate alignment decoder for an implicit duration prediction. Finally, we
extend our framework to the timestamp supervised setting via our proposed
constrained k-medoids algorithm to generate pseudo-segmentations. Our proposed
framework performs consistently on both fully and timestamp supervised
settings, outperforming or competing state-of-the-art on several datasets.
Related papers
- Activity Grammars for Temporal Action Segmentation [71.03141719666972]
temporal action segmentation aims at translating an untrimmed activity video into a sequence of action segments.
This paper introduces an effective activity grammar to guide neural predictions for temporal action segmentation.
Experimental results demonstrate that our method significantly improves temporal action segmentation in terms of both performance and interpretability.
arXiv Detail & Related papers (2023-12-07T12:45:33Z) - MEGA: Multimodal Alignment Aggregation and Distillation For Cinematic
Video Segmentation [10.82074185158027]
We introduce Multimodal alignmEnt aGgregation and distillAtion (MEGA) for cinematic long-video segmentation.
The method coarsely aligns inputs of variable lengths and different modalities with alignment positional encoding.
MEGA employs a novel contrastive loss to synchronize and transfer labels across modalities, enabling act segmentation from labeled synopsis sentences on video shots.
arXiv Detail & Related papers (2023-08-22T04:23:59Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - Distill and Collect for Semi-Supervised Temporal Action Segmentation [0.0]
We propose an approach for the temporal action segmentation task that can simultaneously leverage knowledge from annotated and unannotated video sequences.
Our approach uses multi-stream distillation that repeatedly refines and finally combines their frame predictions.
Our model also predicts the action order, which is later used as a temporal constraint while estimating frames labels to counter the lack of supervision for unannotated videos.
arXiv Detail & Related papers (2022-11-02T17:34:04Z) - A Generalized & Robust Framework For Timestamp Supervision in Temporal
Action Segmentation [79.436224998992]
In temporal action segmentation, Timestamp supervision requires only a handful of labelled frames per video sequence.
We propose a novel Expectation-Maximization based approach that leverages the label uncertainty of unlabelled frames.
Our proposed method produces SOTA results and even exceeds the fully-supervised setup in several metrics and datasets.
arXiv Detail & Related papers (2022-07-20T18:30:48Z) - Efficient Long Sequence Encoding via Synchronization [29.075962393432857]
We propose a synchronization mechanism for hierarchical encoding.
Our approach first identifies anchor tokens across segments and groups them by their roles in the original input sequence.
Our approach is able to improve the global information exchange among segments while maintaining efficiency.
arXiv Detail & Related papers (2022-03-15T04:37:02Z) - Transformers in Action:Weakly Supervised Action Segmentation [81.18941007536468]
We show how to apply transformers to improve action alignment accuracy over the equivalent RNN-based models.
We also propose a supplemental transcript embedding approach to select transcripts more quickly at inference-time.
We evaluate our proposed methods across the benchmark datasets to better understand the applicability of transformers.
arXiv Detail & Related papers (2022-01-14T21:15:58Z) - Learning to Align Sequential Actions in the Wild [123.62879270881807]
We propose an approach to align sequential actions in the wild that involve diverse temporal variations.
Our model accounts for both monotonic and non-monotonic sequences.
We demonstrate that our approach consistently outperforms the state-of-the-art in self-supervised sequential action representation learning.
arXiv Detail & Related papers (2021-11-17T18:55:36Z) - Video Instance Segmentation with a Propose-Reduce Paradigm [68.59137660342326]
Video instance segmentation (VIS) aims to segment and associate all instances of predefined classes for each frame in videos.
Prior methods usually obtain segmentation for a frame or clip first, and then merge the incomplete results by tracking or matching.
We propose a new paradigm -- Propose-Reduce, to generate complete sequences for input videos by a single step.
arXiv Detail & Related papers (2021-03-25T10:58:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.