Collaborative Weakly Supervised Video Correlation Learning for
Procedure-Aware Instructional Video Analysis
- URL: http://arxiv.org/abs/2312.11024v1
- Date: Mon, 18 Dec 2023 08:57:10 GMT
- Title: Collaborative Weakly Supervised Video Correlation Learning for
Procedure-Aware Instructional Video Analysis
- Authors: Tianyao He, Huabin Liu, Yuxi Li, Xiao Ma, Cheng Zhong, Yang Zhang,
Weiyao Lin
- Abstract summary: We introduce a weakly supervised framework for procedure-aware correlation learning on instructional videos.
Our framework comprises two core modules: collaborative step mining and frame-to-step alignment.
We instantiate our framework in two distinct instructional video tasks: sequence verification and action quality assessment.
- Score: 31.541911711448318
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video Correlation Learning (VCL), which aims to analyze the relationships
between videos, has been widely studied and applied in various general video
tasks. However, applying VCL to instructional videos is still quite challenging
due to their intrinsic procedural temporal structure. Specifically, procedural
knowledge is critical for accurate correlation analyses on instructional
videos. Nevertheless, current procedure-learning methods heavily rely on
step-level annotations, which are costly and not scalable. To address this
problem, we introduce a weakly supervised framework called Collaborative
Procedure Alignment (CPA) for procedure-aware correlation learning on
instructional videos. Our framework comprises two core modules: collaborative
step mining and frame-to-step alignment. The collaborative step mining module
enables simultaneous and consistent step segmentation for paired videos,
leveraging the semantic and temporal similarity between frames. Based on the
identified steps, the frame-to-step alignment module performs alignment between
the frames and steps across videos. The alignment result serves as a
measurement of the correlation distance between two videos. We instantiate our
framework in two distinct instructional video tasks: sequence verification and
action quality assessment. Extensive experiments validate the effectiveness of
our approach in providing accurate and interpretable correlation analyses for
instructional videos.
Related papers
- Video alignment using unsupervised learning of local and global features [0.0]
We introduce an unsupervised method for alignment that uses global and local features of the frames.
In particular, we introduce effective features for each video frame using three machine vision tools: person detection, pose estimation, and VGG network.
The resulting time series are used to align videos of the same actions using a novel version of dynamic time warping named Diagonalized Dynamic Time Warping(DDTW)
arXiv Detail & Related papers (2023-04-13T22:20:54Z) - Weakly-supervised Representation Learning for Video Alignment and
Analysis [16.80278496414627]
This paper introduces LRProp -- a novel weakly-supervised representation learning approach.
The proposed algorithm uses also a regularized SoftDTW loss for better tuning the learned features.
Our novel representation learning paradigm consistently outperforms the state of the art on temporal alignment tasks.
arXiv Detail & Related papers (2023-02-08T14:01:01Z) - Weakly-Supervised Online Action Segmentation in Multi-View Instructional
Videos [20.619236432228625]
We present a framework to segment streaming videos online at test time using Dynamic Programming.
We improve our framework by introducing the Online-Offline Discrepancy Loss (OODL) to encourage the segmentation results to have a higher temporal consistency.
arXiv Detail & Related papers (2022-03-24T19:27:56Z) - vCLIMB: A Novel Video Class Incremental Learning Benchmark [53.90485760679411]
We introduce vCLIMB, a novel video continual learning benchmark.
vCLIMB is a standardized test-bed to analyze catastrophic forgetting of deep models in video continual learning.
We propose a temporal consistency regularization that can be applied on top of memory-based continual learning methods.
arXiv Detail & Related papers (2022-01-23T22:14:17Z) - Video Corpus Moment Retrieval with Contrastive Learning [56.249924768243375]
Video corpus moment retrieval (VCMR) is to retrieve a temporal moment that semantically corresponds to a given text query.
We propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR.
Experimental results show that ReLoCLNet encodes text and video separately for efficiency, its retrieval accuracy is comparable with baselines adopting cross-modal interaction learning.
arXiv Detail & Related papers (2021-05-13T12:54:39Z) - CoCon: Cooperative-Contrastive Learning [52.342936645996765]
Self-supervised visual representation learning is key for efficient video analysis.
Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge.
We introduce a cooperative variant of contrastive learning to utilize complementary information across views.
arXiv Detail & Related papers (2021-04-30T05:46:02Z) - Learning to Track Instances without Video Annotations [85.9865889886669]
We introduce a novel semi-supervised framework by learning instance tracking networks with only a labeled image dataset and unlabeled video sequences.
We show that even when only trained with images, the learned feature representation is robust to instance appearance variations.
In addition, we integrate this module into single-stage instance segmentation and pose estimation frameworks.
arXiv Detail & Related papers (2021-04-01T06:47:41Z) - Contrastive Transformation for Self-supervised Correspondence Learning [120.62547360463923]
We study the self-supervised learning of visual correspondence using unlabeled videos in the wild.
Our method simultaneously considers intra- and inter-video representation associations for reliable correspondence estimation.
Our framework outperforms the recent self-supervised correspondence methods on a range of visual tasks.
arXiv Detail & Related papers (2020-12-09T14:05:06Z) - Temporal Context Aggregation for Video Retrieval with Contrastive
Learning [81.12514007044456]
We propose TCA, a video representation learning framework that incorporates long-range temporal information between frame-level features.
The proposed method shows a significant performance advantage (17% mAP on FIVR-200K) over state-of-the-art methods with video-level features.
arXiv Detail & Related papers (2020-08-04T05:24:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.