Collaborative Weakly Supervised Video Correlation Learning for
Procedure-Aware Instructional Video Analysis
- URL: http://arxiv.org/abs/2312.11024v1
- Date: Mon, 18 Dec 2023 08:57:10 GMT
- Title: Collaborative Weakly Supervised Video Correlation Learning for
Procedure-Aware Instructional Video Analysis
- Authors: Tianyao He, Huabin Liu, Yuxi Li, Xiao Ma, Cheng Zhong, Yang Zhang,
Weiyao Lin
- Abstract summary: We introduce a weakly supervised framework for procedure-aware correlation learning on instructional videos.
Our framework comprises two core modules: collaborative step mining and frame-to-step alignment.
We instantiate our framework in two distinct instructional video tasks: sequence verification and action quality assessment.
- Score: 31.541911711448318
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video Correlation Learning (VCL), which aims to analyze the relationships
between videos, has been widely studied and applied in various general video
tasks. However, applying VCL to instructional videos is still quite challenging
due to their intrinsic procedural temporal structure. Specifically, procedural
knowledge is critical for accurate correlation analyses on instructional
videos. Nevertheless, current procedure-learning methods heavily rely on
step-level annotations, which are costly and not scalable. To address this
problem, we introduce a weakly supervised framework called Collaborative
Procedure Alignment (CPA) for procedure-aware correlation learning on
instructional videos. Our framework comprises two core modules: collaborative
step mining and frame-to-step alignment. The collaborative step mining module
enables simultaneous and consistent step segmentation for paired videos,
leveraging the semantic and temporal similarity between frames. Based on the
identified steps, the frame-to-step alignment module performs alignment between
the frames and steps across videos. The alignment result serves as a
measurement of the correlation distance between two videos. We instantiate our
framework in two distinct instructional video tasks: sequence verification and
action quality assessment. Extensive experiments validate the effectiveness of
our approach in providing accurate and interpretable correlation analyses for
instructional videos.
Related papers
- Learning to Localize Actions in Instructional Videos with LLM-Based Multi-Pathway Text-Video Alignment [53.12952107996463]
This work proposes a novel training framework for learning to localize temporal boundaries of procedure steps in training videos.
Motivated by the strong capabilities of Large Language Models (LLMs) in procedure understanding and text summarization, we first apply an LLM to filter out task-irrelevant information and summarize task-related procedure steps from narrations.
To further generate reliable pseudo-matching between the LLM-steps and the video for training, we propose the Multi-Pathway Text-Video Alignment (MPTVA) strategy.
arXiv Detail & Related papers (2024-09-22T18:40:55Z) - Video alignment using unsupervised learning of local and global features [0.0]
We introduce an unsupervised method for alignment that uses global and local features of the frames.
In particular, we introduce effective features for each video frame by means of three machine vision tools: person detection, pose estimation, and VGG network.
The main advantage of our approach is that no training is required, which makes it applicable for any new type of action without any need to collect training samples for it.
arXiv Detail & Related papers (2023-04-13T22:20:54Z) - Weakly-supervised Representation Learning for Video Alignment and
Analysis [16.80278496414627]
This paper introduces LRProp -- a novel weakly-supervised representation learning approach.
The proposed algorithm uses also a regularized SoftDTW loss for better tuning the learned features.
Our novel representation learning paradigm consistently outperforms the state of the art on temporal alignment tasks.
arXiv Detail & Related papers (2023-02-08T14:01:01Z) - Weakly-Supervised Online Action Segmentation in Multi-View Instructional
Videos [20.619236432228625]
We present a framework to segment streaming videos online at test time using Dynamic Programming.
We improve our framework by introducing the Online-Offline Discrepancy Loss (OODL) to encourage the segmentation results to have a higher temporal consistency.
arXiv Detail & Related papers (2022-03-24T19:27:56Z) - Video Corpus Moment Retrieval with Contrastive Learning [56.249924768243375]
Video corpus moment retrieval (VCMR) is to retrieve a temporal moment that semantically corresponds to a given text query.
We propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR.
Experimental results show that ReLoCLNet encodes text and video separately for efficiency, its retrieval accuracy is comparable with baselines adopting cross-modal interaction learning.
arXiv Detail & Related papers (2021-05-13T12:54:39Z) - CoCon: Cooperative-Contrastive Learning [52.342936645996765]
Self-supervised visual representation learning is key for efficient video analysis.
Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge.
We introduce a cooperative variant of contrastive learning to utilize complementary information across views.
arXiv Detail & Related papers (2021-04-30T05:46:02Z) - Learning to Track Instances without Video Annotations [85.9865889886669]
We introduce a novel semi-supervised framework by learning instance tracking networks with only a labeled image dataset and unlabeled video sequences.
We show that even when only trained with images, the learned feature representation is robust to instance appearance variations.
In addition, we integrate this module into single-stage instance segmentation and pose estimation frameworks.
arXiv Detail & Related papers (2021-04-01T06:47:41Z) - Contrastive Transformation for Self-supervised Correspondence Learning [120.62547360463923]
We study the self-supervised learning of visual correspondence using unlabeled videos in the wild.
Our method simultaneously considers intra- and inter-video representation associations for reliable correspondence estimation.
Our framework outperforms the recent self-supervised correspondence methods on a range of visual tasks.
arXiv Detail & Related papers (2020-12-09T14:05:06Z) - Temporal Context Aggregation for Video Retrieval with Contrastive
Learning [81.12514007044456]
We propose TCA, a video representation learning framework that incorporates long-range temporal information between frame-level features.
The proposed method shows a significant performance advantage (17% mAP on FIVR-200K) over state-of-the-art methods with video-level features.
arXiv Detail & Related papers (2020-08-04T05:24:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.