Related papers: MS4UI: A Dataset for Multi-modal Summarization of User Interface Instructional Videos

MS4UI: A Dataset for Multi-modal Summarization of User Interface Instructional Videos

URL: http://arxiv.org/abs/2506.12623v1
Date: Sat, 14 Jun 2025 20:39:32 GMT
Title: MS4UI: A Dataset for Multi-modal Summarization of User Interface Instructional Videos
Authors: Yuan Zang, Hao Tan, Seunghyun Yoon, Franck Dernoncourt, Jiuxiang Gu, Kushal Kafle, Chen Sun, Trung Bui,
Abstract summary: We study multi-modal summarization for instructional videos, whose goal is to provide users an efficient way to learn skills in the form of text instructions and key video frames.<n>We propose a novel benchmark for user interface (UI) instructional video summarization to fill the gap.<n>We collect a dataset of 2,413 UI instructional videos, which spans over 167 hours.
Score: 77.59558834294134
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study multi-modal summarization for instructional videos, whose goal is to provide users an efficient way to learn skills in the form of text instructions and key video frames. We observe that existing benchmarks focus on generic semantic-level video summarization, and are not suitable for providing step-by-step executable instructions and illustrations, both of which are crucial for instructional videos. We propose a novel benchmark for user interface (UI) instructional video summarization to fill the gap. We collect a dataset of 2,413 UI instructional videos, which spans over 167 hours. These videos are manually annotated for video segmentation, text summarization, and video summarization, which enable the comprehensive evaluations for concise and executable video summarization. We conduct extensive experiments on our collected MS4UI dataset, which suggest that state-of-the-art multi-modal summarization methods struggle on UI video summarization, and highlight the importance of new methods for UI instructional video summarization.

Related papers

SD-VSum: A Method and Dataset for Script-Driven Video Summarization [6.076406622352117]
We introduce the task of script-driven video summarization (VideoXum)<n>We produce natural language descriptions of the different human-annotated summaries that are available per video.<n>We develop a new network architecture for script-driven video summarization (SD-VSum)
arXiv Detail & Related papers (2025-05-06T08:47:14Z)
HierSum: A Global and Local Attention Mechanism for Video Summarization [14.88934924520362]
We focus on summarizing instructional videos and propose a method for breaking down a video into meaningful segments.<n>HierSum integrates fine-grained local cues from subtitles with global contextual information provided by video-level instructions.<n>We show that HierSum consistently outperforms existing methods in key metrics such as F1-score and rank correlation.
arXiv Detail & Related papers (2025-04-25T20:30:30Z)
VideoMix: Aggregating How-To Videos for Task-Oriented Learning [36.183779096566276]
VideoMix is a system that helps users gain a holistic understanding of a how-to task by aggregating information from multiple videos on the task.<n>Powered by a Vision-Language Model pipeline, VideoMix extracts and organizes this information, presenting concise textual summaries alongside relevant video clips.
arXiv Detail & Related papers (2025-03-27T03:43:02Z)
V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning [76.26890864487933]
Video summarization aims to create short, accurate, and cohesive summaries of longer videos. Most existing datasets are created for video-to-video summarization. Recent efforts have been made to expand from unimodal to multimodal video summarization.
arXiv Detail & Related papers (2024-04-18T17:32:46Z)
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation [90.71796406228265]
InternVid is a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations. The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words.
arXiv Detail & Related papers (2023-07-13T17:58:32Z)
TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency [133.75876535332003]
We focus on summarizing instructional videos, an under-explored area of video summarization. Existing video summarization datasets rely on manual frame-level annotations. We propose an instructional video summarization network that combines a context-aware temporal video encoder and a segment scoring transformer.
arXiv Detail & Related papers (2022-08-14T04:07:40Z)
Video Summarization Based on Video-text Modelling [0.0]
We propose a multimodal self-supervised learning framework to obtain semantic representations of videos. We also introduce a progressive video summarization method, where the important content in a video is pinpointed progressively to generate better summaries. An objective evaluation framework is proposed to measure the quality of video summaries based on video classification.
arXiv Detail & Related papers (2022-01-07T15:21:46Z)
VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation [124.02278735049235]
VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels. We evaluate various baseline methods with and without large-scale VidL pre-training. The significant gap between our best model and human performance calls for future study for advanced VidL models.
arXiv Detail & Related papers (2021-06-08T18:34:21Z)
Learning Video Representations from Textual Web Supervision [97.78883761035557]
We propose to use text as a method for learning video representations. We collect 70M video clips shared publicly on the Internet and train a model to pair each video with its associated text. We find that this approach is an effective method of pre-training video representations.
arXiv Detail & Related papers (2020-07-29T16:19:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.