Weakly-Supervised Online Action Segmentation in Multi-View Instructional
Videos
- URL: http://arxiv.org/abs/2203.13309v1
- Date: Thu, 24 Mar 2022 19:27:56 GMT
- Title: Weakly-Supervised Online Action Segmentation in Multi-View Instructional
Videos
- Authors: Reza Ghoddoosian, Isht Dwivedi, Nakul Agarwal, Chiho Choi, Behzad
Dariush
- Abstract summary: We present a framework to segment streaming videos online at test time using Dynamic Programming.
We improve our framework by introducing the Online-Offline Discrepancy Loss (OODL) to encourage the segmentation results to have a higher temporal consistency.
- Score: 20.619236432228625
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This paper addresses a new problem of weakly-supervised online action
segmentation in instructional videos. We present a framework to segment
streaming videos online at test time using Dynamic Programming and show its
advantages over greedy sliding window approach. We improve our framework by
introducing the Online-Offline Discrepancy Loss (OODL) to encourage the
segmentation results to have a higher temporal consistency. Furthermore, only
during training, we exploit frame-wise correspondence between multiple views as
supervision for training weakly-labeled instructional videos. In particular, we
investigate three different multi-view inference techniques to generate more
accurate frame-wise pseudo ground-truth with no additional annotation cost. We
present results and ablation studies on two benchmark multi-view datasets,
Breakfast and IKEA ASM. Experimental results show efficacy of the proposed
methods both qualitatively and quantitatively in two domains of cooking and
assembly.
Related papers
- Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding [70.31050639330603]
Video paragraph grounding aims at localizing multiple sentences with semantic relations and temporal order from an untrimmed video.
Existing VPG approaches are heavily reliant on a considerable number of temporal labels that are laborious and time-consuming to acquire.
We introduce and explore Weakly-Supervised Video paragraph Grounding (WSVPG) to eliminate the need of temporal annotations.
arXiv Detail & Related papers (2024-03-18T04:30:31Z) - Collaborative Weakly Supervised Video Correlation Learning for
Procedure-Aware Instructional Video Analysis [31.541911711448318]
We introduce a weakly supervised framework for procedure-aware correlation learning on instructional videos.
Our framework comprises two core modules: collaborative step mining and frame-to-step alignment.
We instantiate our framework in two distinct instructional video tasks: sequence verification and action quality assessment.
arXiv Detail & Related papers (2023-12-18T08:57:10Z) - Proposal-Based Multiple Instance Learning for Weakly-Supervised Temporal
Action Localization [98.66318678030491]
Weakly-supervised temporal action localization aims to localize and recognize actions in untrimmed videos with only video-level category labels during training.
We propose a novel Proposal-based Multiple Instance Learning (P-MIL) framework that directly classifies the candidate proposals in both the training and testing stages.
arXiv Detail & Related papers (2023-05-29T02:48:04Z) - Solve the Puzzle of Instance Segmentation in Videos: A Weakly Supervised
Framework with Spatio-Temporal Collaboration [13.284951215948052]
We present a novel weakly supervised framework with textbfS-patiotextbfTemporal textbfClaboration for instance textbfSegmentation in videos.
Our method achieves strong performance and even outperforms fully supervised TrackR-CNN and MaskTrack R-CNN.
arXiv Detail & Related papers (2022-12-15T02:44:13Z) - Self-Supervised Video Representation Learning with Motion-Contrastive
Perception [13.860736711747284]
Motion-Contrastive Perception Network (MCPNet)
MCPNet consists of two branches, namely, Motion Information Perception (MIP) and Contrastive Instance Perception (CIP)
Our method outperforms current state-of-the-art visual-only self-supervised approaches.
arXiv Detail & Related papers (2022-04-10T05:34:46Z) - Weakly-Supervised Spatio-Temporal Anomaly Detection in Surveillance
Video [128.41392860714635]
We introduce Weakly-Supervised Snoma-Temporally Detection (WSSTAD) in surveillance video.
WSSTAD aims to localize a-temporal tube (i.e. sequence of bounding boxes at consecutive times) that encloses abnormal event.
We propose a dual-branch network which takes as input proposals with multi-granularities in both spatial-temporal domains.
arXiv Detail & Related papers (2021-08-09T06:11:14Z) - Learning to Track Instances without Video Annotations [85.9865889886669]
We introduce a novel semi-supervised framework by learning instance tracking networks with only a labeled image dataset and unlabeled video sequences.
We show that even when only trained with images, the learned feature representation is robust to instance appearance variations.
In addition, we integrate this module into single-stage instance segmentation and pose estimation frameworks.
arXiv Detail & Related papers (2021-04-01T06:47:41Z) - Self-supervised Video Object Segmentation [76.83567326586162]
The objective of this paper is self-supervised representation learning, with the goal of solving semi-supervised video object segmentation (a.k.a. dense tracking)
We make the following contributions: (i) we propose to improve the existing self-supervised approach, with a simple, yet more effective memory mechanism for long-term correspondence matching; (ii) by augmenting the self-supervised approach with an online adaptation module, our method successfully alleviates tracker drifts caused by spatial-temporal discontinuity; (iv) we demonstrate state-of-the-art results among the self-supervised approaches on DAVIS-2017 and YouTube
arXiv Detail & Related papers (2020-06-22T17:55:59Z) - Weakly-Supervised Multi-Level Attentional Reconstruction Network for
Grounding Textual Queries in Videos [73.4504252917816]
The task of temporally grounding textual queries in videos is to localize one video segment that semantically corresponds to the given query.
Most of the existing approaches rely on segment-sentence pairs (temporal annotations) for training, which are usually unavailable in real-world scenarios.
We present an effective weakly-supervised model, named as Multi-Level Attentional Reconstruction Network (MARN), which only relies on video-sentence pairs during the training stage.
arXiv Detail & Related papers (2020-03-16T07:01:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.