Related papers: Aligning Step-by-Step Instructional Diagrams to Video Demonstrations

Aligning Step-by-Step Instructional Diagrams to Video Demonstrations

URL: http://arxiv.org/abs/2303.13800v4
Date: Thu, 21 Mar 2024 02:31:39 GMT
Title: Aligning Step-by-Step Instructional Diagrams to Video Demonstrations
Authors: Jiahao Zhang, Anoop Cherian, Yanbin Liu, Yizhak Ben-Shabat, Cristian Rodriguez, Stephen Gould,
Abstract summary: We consider a novel setting where alignment is between (i) instruction steps that are depicted as assembly diagrams and (ii) video segments from in-the-wild videos. We introduce a novel supervised contrastive learning method that learns to align videos with the subtle details in the assembly diagrams. Experiments on IAW for Ikea assembly in the wild demonstrate superior performances of our approach against alternatives.
Score: 51.67930509196712
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal alignment facilitates the retrieval of instances from one modality when queried using another. In this paper, we consider a novel setting where such an alignment is between (i) instruction steps that are depicted as assembly diagrams (commonly seen in Ikea assembly manuals) and (ii) video segments from in-the-wild videos; these videos comprising an enactment of the assembly actions in the real world. To learn this alignment, we introduce a novel supervised contrastive learning method that learns to align videos with the subtle details in the assembly diagrams, guided by a set of novel losses. To study this problem and demonstrate the effectiveness of our method, we introduce a novel dataset: IAW for Ikea assembly in the wild consisting of 183 hours of videos from diverse furniture assembly collections and nearly 8,300 illustrations from their associated instruction manuals and annotated for their ground truth alignments. We define two tasks on this dataset: First, nearest neighbor retrieval between video segments and illustrations, and, second, alignment of instruction steps and the segments for each video. Extensive experiments on IAW demonstrate superior performances of our approach against alternatives.

Related papers

HierSum: A Global and Local Attention Mechanism for Video Summarization [14.88934924520362]
We focus on summarizing instructional videos and propose a method for breaking down a video into meaningful segments. HierSum integrates fine-grained local cues from subtitles with global contextual information provided by video-level instructions. We show that HierSum consistently outperforms existing methods in key metrics such as F1-score and rank correlation.
arXiv Detail & Related papers (2025-04-25T20:30:30Z)
Manual-PA: Learning 3D Part Assembly from Instruction Diagrams [54.555154845137906]
We present Manual-PA, a transformer-based instruction Manual-guided 3D Part Assembly framework. Our results show that using the diagrams and the order of the parts lead to significant improvements in assembly performance against the state of the art.
arXiv Detail & Related papers (2024-11-27T03:10:29Z)
IKEA Manuals at Work: 4D Grounding of Assembly Instructions on Internet Videos [34.67148665646724]
We introduce IKEA Video Manuals, a dataset that features 3D models of furniture parts, instructional manuals, assembly videos from the Internet, and most importantly, annotations of dense-temporal alignments between these data modalities. We present five applications essential for shape assembly: assembly plan generation, part-conditioned segmentation, part-conditioned pose estimation, video object segmentation, and furniture assembly based on instructional video manuals.
arXiv Detail & Related papers (2024-11-18T09:30:05Z)
Learning to Localize Actions in Instructional Videos with LLM-Based Multi-Pathway Text-Video Alignment [53.12952107996463]
This work proposes a novel training framework for learning to localize temporal boundaries of procedure steps in training videos. Motivated by the strong capabilities of Large Language Models (LLMs) in procedure understanding and text summarization, we first apply an LLM to filter out task-irrelevant information and summarize task-related procedure steps from narrations. To further generate reliable pseudo-matching between the LLM-steps and the video for training, we propose the Multi-Pathway Text-Video Alignment (MPTVA) strategy.
arXiv Detail & Related papers (2024-09-22T18:40:55Z)
Collaborative Weakly Supervised Video Correlation Learning for Procedure-Aware Instructional Video Analysis [31.541911711448318]
We introduce a weakly supervised framework for procedure-aware correlation learning on instructional videos. Our framework comprises two core modules: collaborative step mining and frame-to-step alignment. We instantiate our framework in two distinct instructional video tasks: sequence verification and action quality assessment.
arXiv Detail & Related papers (2023-12-18T08:57:10Z)
Learning to Ground Instructional Articles in Videos through Narrations [50.3463147014498]
We present an approach for localizing steps of procedural activities in narrated how-to videos. We source the step descriptions from a language knowledge base (wikiHow) containing instructional articles. Our model learns to temporally ground the steps of procedural articles in how-to videos by matching three modalities.
arXiv Detail & Related papers (2023-06-06T15:45:53Z)
InstructVid2Vid: Controllable Video Editing with Natural Language Instructions [97.17047888215284]
InstructVid2Vid is an end-to-end diffusion-based methodology for video editing guided by human language instructions. Our approach empowers video manipulation guided by natural language directives, eliminating the need for per-example fine-tuning or inversion.
arXiv Detail & Related papers (2023-05-21T03:28:13Z)
Video alignment using unsupervised learning of local and global features [0.0]
We introduce an unsupervised method for alignment that uses global and local features of the frames. In particular, we introduce effective features for each video frame by means of three machine vision tools: person detection, pose estimation, and VGG network. The main advantage of our approach is that no training is required, which makes it applicable for any new type of action without any need to collect training samples for it.
arXiv Detail & Related papers (2023-04-13T22:20:54Z)
TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency [133.75876535332003]
We focus on summarizing instructional videos, an under-explored area of video summarization. Existing video summarization datasets rely on manual frame-level annotations. We propose an instructional video summarization network that combines a context-aware temporal video encoder and a segment scoring transformer.
arXiv Detail & Related papers (2022-08-14T04:07:40Z)
A Survey on Deep Learning Technique for Video Segmentation [147.0767454918527]
Video segmentation plays a critical role in a broad range of practical applications. Deep learning based approaches have been dedicated to video segmentation and delivered compelling performance.
arXiv Detail & Related papers (2021-07-02T15:51:07Z)
Multimodal Pretraining for Dense Video Captioning [26.39052753539932]
We construct and release a new dense video captioning dataset, Video Timeline Tags (ViTT) We explore several multimodal sequence-to-sequence pretraining strategies that leverage large unsupervised datasets of videos and caption-like texts. We show that such models generalize well and are robust over a wide variety of instructional videos.
arXiv Detail & Related papers (2020-11-10T21:49:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.