Action Duration Prediction for Segment-Level Alignment of Weakly-Labeled
Videos
- URL: http://arxiv.org/abs/2011.10190v1
- Date: Fri, 20 Nov 2020 03:16:53 GMT
- Title: Action Duration Prediction for Segment-Level Alignment of Weakly-Labeled
Videos
- Authors: Reza Ghoddoosian, Saif Sayed, Vassilis Athitsos
- Abstract summary: This paper focuses on weakly-supervised action alignment, where only the ordered sequence of video-level actions is available for training.
We propose a novel Duration Network, which captures a short temporal window of the video and learns to predict the remaining duration of a given action at any point in time with a level of granularity based on the type of that action.
- Score: 4.318555434063273
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper focuses on weakly-supervised action alignment, where only the
ordered sequence of video-level actions is available for training. We propose a
novel Duration Network, which captures a short temporal window of the video and
learns to predict the remaining duration of a given action at any point in time
with a level of granularity based on the type of that action. Further, we
introduce a Segment-Level Beam Search to obtain the best alignment, that
maximizes our posterior probability. Segment-Level Beam Search efficiently
aligns actions by considering only a selected set of frames that have more
confident predictions. The experimental results show that our alignments for
long videos are more robust than existing models. Moreover, the proposed method
achieves state of the art results in certain cases on the popular Breakfast and
Hollywood Extended datasets.
Related papers
- Generation-Guided Multi-Level Unified Network for Video Grounding [18.402093379973085]
Video grounding aims to locate the timestamps best matching the query description within an untrimmed video.
Moment-level approaches directly predict the probability of each transient moment to be the boundary in a global perspective.
Clip-level ones aggregate the moments in different time windows into proposals and then deduce the most similar one, leading to its advantage in fine-grained grounding.
arXiv Detail & Related papers (2023-03-14T09:48:59Z) - Distill and Collect for Semi-Supervised Temporal Action Segmentation [0.0]
We propose an approach for the temporal action segmentation task that can simultaneously leverage knowledge from annotated and unannotated video sequences.
Our approach uses multi-stream distillation that repeatedly refines and finally combines their frame predictions.
Our model also predicts the action order, which is later used as a temporal constraint while estimating frames labels to counter the lack of supervision for unannotated videos.
arXiv Detail & Related papers (2022-11-02T17:34:04Z) - The Wisdom of Crowds: Temporal Progressive Attention for Early Action
Prediction [104.628661890361]
Early action prediction deals with inferring the ongoing action from partially-observed videos, typically at the outset of the video.
We propose a bottleneck-based attention model that captures the evolution of the action, through progressive sampling over fine-to-coarse scales.
arXiv Detail & Related papers (2022-04-28T08:21:09Z) - Unsupervised Pre-training for Temporal Action Localization Tasks [76.01985780118422]
We propose a self-supervised pretext task, coined as Pseudo Action localization (PAL) to Unsupervisedly Pre-train feature encoders for Temporal Action localization tasks (UP-TAL)
Specifically, we first randomly select temporal regions, each of which contains multiple clips, from one video as pseudo actions and then paste them onto different temporal positions of the other two videos.
The pretext task is to align the features of pasted pseudo action regions from two synthetic videos and maximize the agreement between them.
arXiv Detail & Related papers (2022-03-25T12:13:43Z) - Part-level Action Parsing via a Pose-guided Coarse-to-Fine Framework [108.70949305791201]
Part-level Action Parsing (PAP) aims to not only predict the video-level action but also recognize the frame-level fine-grained actions or interactions of body parts for each person in the video.
In particular, our framework first predicts the video-level class of the input video, then localizes the body parts and predicts the part-level action.
Our framework achieves state-of-the-art performance and outperforms existing methods over a 31.10% ROC score.
arXiv Detail & Related papers (2022-03-09T01:30:57Z) - Anchor-Constrained Viterbi for Set-Supervised Action Segmentation [38.32743770719661]
This paper is about action segmentation under weak supervision in training.
We use a Hidden Markov Model (HMM) grounded on a multilayer perceptron (MLP) to label video frames.
In testing, a Monte Carlo sampling of action sets seen in training is used to generate candidate temporal sequences of actions.
arXiv Detail & Related papers (2021-04-05T18:50:21Z) - Point-Level Temporal Action Localization: Bridging Fully-supervised
Proposals to Weakly-supervised Losses [84.2964408497058]
Point-level temporal action localization (PTAL) aims to localize actions in untrimmed videos with only one timestamp annotation for each action instance.
Existing methods adopt the frame-level prediction paradigm to learn from the sparse single-frame labels.
This paper attempts to explore the proposal-based prediction paradigm for point-level annotations.
arXiv Detail & Related papers (2020-12-15T12:11:48Z) - MS-TCN++: Multi-Stage Temporal Convolutional Network for Action
Segmentation [87.16030562892537]
We propose a multi-stage architecture for the temporal action segmentation task.
The first stage generates an initial prediction that is refined by the next ones.
Our models achieve state-of-the-art results on three datasets.
arXiv Detail & Related papers (2020-06-16T14:50:47Z) - Hierarchical Attention Network for Action Segmentation [45.19890687786009]
The temporal segmentation of events is an essential task and a precursor for the automatic recognition of human actions in the video.
We propose a complete end-to-end supervised learning approach that can better learn relationships between actions over time.
We evaluate our system on challenging public benchmark datasets, including MERL Shopping, 50 salads, and Georgia Tech Egocentric datasets.
arXiv Detail & Related papers (2020-05-07T02:39:18Z) - Fast Template Matching and Update for Video Object Tracking and
Segmentation [56.465510428878]
The main task we aim to tackle is the multi-instance semi-supervised video object segmentation across a sequence of frames.
The challenges lie in the selection of the matching method to predict the result as well as to decide whether to update the target template.
We propose a novel approach which utilizes reinforcement learning to make these two decisions at the same time.
arXiv Detail & Related papers (2020-04-16T08:58:45Z) - SCT: Set Constrained Temporal Transformer for Set Supervised Action
Segmentation [22.887397951846353]
Weakly supervised approaches aim at learning temporal action segmentation from videos that are only weakly labeled.
We propose an approach that can be trained end-to-end on such data.
We evaluate our approach on three datasets where the approach achieves state-of-the-art results.
arXiv Detail & Related papers (2020-03-31T14:51:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.