Aligning Step-by-Step Instructional Diagrams to Video Demonstrations
- URL: http://arxiv.org/abs/2303.13800v4
- Date: Thu, 21 Mar 2024 02:31:39 GMT
- Title: Aligning Step-by-Step Instructional Diagrams to Video Demonstrations
- Authors: Jiahao Zhang, Anoop Cherian, Yanbin Liu, Yizhak Ben-Shabat, Cristian Rodriguez, Stephen Gould,
- Abstract summary: We consider a novel setting where alignment is between (i) instruction steps that are depicted as assembly diagrams and (ii) video segments from in-the-wild videos.
We introduce a novel supervised contrastive learning method that learns to align videos with the subtle details in the assembly diagrams.
Experiments on IAW for Ikea assembly in the wild demonstrate superior performances of our approach against alternatives.
- Score: 51.67930509196712
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal alignment facilitates the retrieval of instances from one modality when queried using another. In this paper, we consider a novel setting where such an alignment is between (i) instruction steps that are depicted as assembly diagrams (commonly seen in Ikea assembly manuals) and (ii) video segments from in-the-wild videos; these videos comprising an enactment of the assembly actions in the real world. To learn this alignment, we introduce a novel supervised contrastive learning method that learns to align videos with the subtle details in the assembly diagrams, guided by a set of novel losses. To study this problem and demonstrate the effectiveness of our method, we introduce a novel dataset: IAW for Ikea assembly in the wild consisting of 183 hours of videos from diverse furniture assembly collections and nearly 8,300 illustrations from their associated instruction manuals and annotated for their ground truth alignments. We define two tasks on this dataset: First, nearest neighbor retrieval between video segments and illustrations, and, second, alignment of instruction steps and the segments for each video. Extensive experiments on IAW demonstrate superior performances of our approach against alternatives.
Related papers
- Collaborative Weakly Supervised Video Correlation Learning for
Procedure-Aware Instructional Video Analysis [31.541911711448318]
We introduce a weakly supervised framework for procedure-aware correlation learning on instructional videos.
Our framework comprises two core modules: collaborative step mining and frame-to-step alignment.
We instantiate our framework in two distinct instructional video tasks: sequence verification and action quality assessment.
arXiv Detail & Related papers (2023-12-18T08:57:10Z) - Learning to Ground Instructional Articles in Videos through Narrations [50.3463147014498]
We present an approach for localizing steps of procedural activities in narrated how-to videos.
We source the step descriptions from a language knowledge base (wikiHow) containing instructional articles.
Our model learns to temporally ground the steps of procedural articles in how-to videos by matching three modalities.
arXiv Detail & Related papers (2023-06-06T15:45:53Z) - InstructVid2Vid: Controllable Video Editing with Natural Language Instructions [97.17047888215284]
InstructVid2Vid is an end-to-end diffusion-based methodology for video editing guided by human language instructions.
Our approach empowers video manipulation guided by natural language directives, eliminating the need for per-example fine-tuning or inversion.
arXiv Detail & Related papers (2023-05-21T03:28:13Z) - Video alignment using unsupervised learning of local and global features [0.0]
We introduce an unsupervised method for alignment that uses global and local features of the frames.
In particular, we introduce effective features for each video frame using three machine vision tools: person detection, pose estimation, and VGG network.
The resulting time series are used to align videos of the same actions using a novel version of dynamic time warping named Diagonalized Dynamic Time Warping(DDTW)
arXiv Detail & Related papers (2023-04-13T22:20:54Z) - TL;DW? Summarizing Instructional Videos with Task Relevance &
Cross-Modal Saliency [133.75876535332003]
We focus on summarizing instructional videos, an under-explored area of video summarization.
Existing video summarization datasets rely on manual frame-level annotations.
We propose an instructional video summarization network that combines a context-aware temporal video encoder and a segment scoring transformer.
arXiv Detail & Related papers (2022-08-14T04:07:40Z) - A Survey on Deep Learning Technique for Video Segmentation [147.0767454918527]
Video segmentation plays a critical role in a broad range of practical applications.
Deep learning based approaches have been dedicated to video segmentation and delivered compelling performance.
arXiv Detail & Related papers (2021-07-02T15:51:07Z) - Multimodal Pretraining for Dense Video Captioning [26.39052753539932]
We construct and release a new dense video captioning dataset, Video Timeline Tags (ViTT)
We explore several multimodal sequence-to-sequence pretraining strategies that leverage large unsupervised datasets of videos and caption-like texts.
We show that such models generalize well and are robust over a wide variety of instructional videos.
arXiv Detail & Related papers (2020-11-10T21:49:14Z) - Motion-supervised Co-Part Segmentation [88.40393225577088]
We propose a self-supervised deep learning method for co-part segmentation.
Our approach develops the idea that motion information inferred from videos can be leveraged to discover meaningful object parts.
arXiv Detail & Related papers (2020-04-07T09:56:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.