SF-Net: Single-Frame Supervision for Temporal Action Localization
- URL: http://arxiv.org/abs/2003.06845v6
- Date: Sat, 15 Aug 2020 04:20:57 GMT
- Title: SF-Net: Single-Frame Supervision for Temporal Action Localization
- Authors: Fan Ma, Linchao Zhu, Yi Yang, Shengxin Zha, Gourab Kundu, Matt
Feiszli, Zheng Shou
- Abstract summary: Single-frame supervision introduces extra temporal action signals while maintaining low annotation overhead.
We propose a unified system called SF-Net to make use of such single-frame supervision.
SF-Net significantly improves upon state-of-the-art weakly-supervised methods in terms of both segment localization and single-frame localization.
- Score: 60.202516362976645
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we study an intermediate form of supervision, i.e.,
single-frame supervision, for temporal action localization (TAL). To obtain the
single-frame supervision, the annotators are asked to identify only a single
frame within the temporal window of an action. This can significantly reduce
the labor cost of obtaining full supervision which requires annotating the
action boundary. Compared to the weak supervision that only annotates the
video-level label, the single-frame supervision introduces extra temporal
action signals while maintaining low annotation overhead. To make full use of
such single-frame supervision, we propose a unified system called SF-Net.
First, we propose to predict an actionness score for each video frame. Along
with a typical category score, the actionness score can provide comprehensive
information about the occurrence of a potential action and aid the temporal
boundary refinement during inference. Second, we mine pseudo action and
background frames based on the single-frame annotations. We identify pseudo
action frames by adaptively expanding each annotated single frame to its
nearby, contextual frames and we mine pseudo background frames from all the
unannotated frames across multiple videos. Together with the ground-truth
labeled frames, these pseudo-labeled frames are further used for training the
classifier. In extensive experiments on THUMOS14, GTEA, and BEOID, SF-Net
significantly improves upon state-of-the-art weakly-supervised methods in terms
of both segment localization and single-frame localization. Notably, SF-Net
achieves comparable results to its fully-supervised counterpart which requires
much more resource intensive annotations. The code is available at
https://github.com/Flowerfan/SF-Net.
Related papers
- Rethinking the Video Sampling and Reasoning Strategies for Temporal
Sentence Grounding [64.99924160432144]
Temporal sentence grounding (TSG) aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query.
We propose a novel Siamese Sampling and Reasoning Network (SSRN) for TSG, which introduces a siamese sampling mechanism to generate additional contextual frames.
arXiv Detail & Related papers (2023-01-02T03:38:22Z) - A Generalized & Robust Framework For Timestamp Supervision in Temporal
Action Segmentation [79.436224998992]
In temporal action segmentation, Timestamp supervision requires only a handful of labelled frames per video sequence.
We propose a novel Expectation-Maximization based approach that leverages the label uncertainty of unlabelled frames.
Our proposed method produces SOTA results and even exceeds the fully-supervised setup in several metrics and datasets.
arXiv Detail & Related papers (2022-07-20T18:30:48Z) - Context Sensing Attention Network for Video-based Person
Re-identification [20.865710012336724]
Video-based person re-identification (ReID) is challenging due to the presence of various interferences in video frames.
Recent approaches handle this problem using temporal aggregation strategies.
We propose a novel Context Sensing Attention Network (CSA-Net), which improves both the frame feature extraction and temporal aggregation steps.
arXiv Detail & Related papers (2022-07-06T12:48:27Z) - Temporal Transductive Inference for Few-Shot Video Object Segmentation [27.140141181513425]
Few-shot object segmentation (FS-VOS) aims at segmenting video frames using a few labelled examples of classes not seen during initial training.
Key to our approach is the use of both global and local temporal constraints.
Empirically, our model outperforms state-of-the-art meta-learning approaches in terms of mean intersection over union on YouTube-VIS by 2.8%.
arXiv Detail & Related papers (2022-03-27T14:08:30Z) - Flow-Guided Sparse Transformer for Video Deblurring [124.11022871999423]
FlowGuided Sparse Transformer (F GST) is a framework for video deblurring.
FGSW-MSA enjoys the guidance of the estimated optical flow to globally sample spatially sparse elements corresponding to the same scene patch in neighboring frames.
Our proposed F GST outperforms state-of-the-art patches on both DVD and GOPRO datasets and even yields more visually pleasing results in real video deblurring.
arXiv Detail & Related papers (2022-01-06T02:05:32Z) - Background-Click Supervision for Temporal Action Localization [82.4203995101082]
Weakly supervised temporal action localization aims at learning the instance-level action pattern from the video-level labels, where a significant challenge is action-context confusion.
One recent work builds an action-click supervision framework.
It requires similar annotation costs but can steadily improve the localization performance when compared to the conventional weakly supervised methods.
In this paper, by revealing that the performance bottleneck of the existing approaches mainly comes from the background errors, we find that a stronger action localizer can be trained with labels on the background video frames rather than those on the action frames.
arXiv Detail & Related papers (2021-11-24T12:02:52Z) - No frame left behind: Full Video Action Recognition [26.37329995193377]
We propose full video action recognition and consider all video frames.
We first cluster all frame activations along the temporal dimension.
We then temporally aggregate the frames in the clusters into a smaller number of representations.
arXiv Detail & Related papers (2021-03-29T07:44:28Z) - Temporally-Weighted Hierarchical Clustering for Unsupervised Action
Segmentation [96.67525775629444]
Action segmentation refers to inferring boundaries of semantically consistent visual concepts in videos.
We present a fully automatic and unsupervised approach for segmenting actions in a video that does not require any training.
Our proposal is an effective temporally-weighted hierarchical clustering algorithm that can group semantically consistent frames of the video.
arXiv Detail & Related papers (2021-03-20T23:30:01Z) - Hierarchical Attention Network for Action Segmentation [45.19890687786009]
The temporal segmentation of events is an essential task and a precursor for the automatic recognition of human actions in the video.
We propose a complete end-to-end supervised learning approach that can better learn relationships between actions over time.
We evaluate our system on challenging public benchmark datasets, including MERL Shopping, 50 salads, and Georgia Tech Egocentric datasets.
arXiv Detail & Related papers (2020-05-07T02:39:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.