TL;DW? Summarizing Instructional Videos with Task Relevance &
Cross-Modal Saliency
- URL: http://arxiv.org/abs/2208.06773v1
- Date: Sun, 14 Aug 2022 04:07:40 GMT
- Title: TL;DW? Summarizing Instructional Videos with Task Relevance &
Cross-Modal Saliency
- Authors: Medhini Narasimhan, Arsha Nagrani, Chen Sun, Michael Rubinstein,
Trevor Darrell, Anna Rohrbach, Cordelia Schmid
- Abstract summary: We focus on summarizing instructional videos, an under-explored area of video summarization.
Existing video summarization datasets rely on manual frame-level annotations.
We propose an instructional video summarization network that combines a context-aware temporal video encoder and a segment scoring transformer.
- Score: 133.75876535332003
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: YouTube users looking for instructions for a specific task may spend a long
time browsing content trying to find the right video that matches their needs.
Creating a visual summary (abridged version of a video) provides viewers with a
quick overview and massively reduces search time. In this work, we focus on
summarizing instructional videos, an under-explored area of video
summarization. In comparison to generic videos, instructional videos can be
parsed into semantically meaningful segments that correspond to important steps
of the demonstrated task. Existing video summarization datasets rely on manual
frame-level annotations, making them subjective and limited in size. To
overcome this, we first automatically generate pseudo summaries for a corpus of
instructional videos by exploiting two key assumptions: (i) relevant steps are
likely to appear in multiple videos of the same task (Task Relevance), and (ii)
they are more likely to be described by the demonstrator verbally (Cross-Modal
Saliency). We propose an instructional video summarization network that
combines a context-aware temporal video encoder and a segment scoring
transformer. Using pseudo summaries as weak supervision, our network constructs
a visual summary for an instructional video given only video and transcribed
speech. To evaluate our model, we collect a high-quality test set, WikiHow
Summaries, by scraping WikiHow articles that contain video demonstrations and
visual depictions of steps allowing us to obtain the ground-truth summaries. We
outperform several baselines and a state-of-the-art video summarization model
on this new benchmark.
Related papers
- V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning [76.26890864487933]
Video summarization aims to create short, accurate, and cohesive summaries of longer videos.
Most existing datasets are created for video-to-video summarization.
Recent efforts have been made to expand from unimodal to multimodal video summarization.
arXiv Detail & Related papers (2024-04-18T17:32:46Z) - Shot2Story20K: A New Benchmark for Comprehensive Understanding of
Multi-shot Videos [58.13927287437394]
We present a new multi-shot video understanding benchmark Shot2Story20K with detailed shot-level captions and comprehensive video summaries.
Preliminary experiments show some challenges to generate a long and comprehensive video summary.
arXiv Detail & Related papers (2023-12-16T03:17:30Z) - Hierarchical Video-Moment Retrieval and Step-Captioning [68.4859260853096]
HiREST consists of 3.4K text-video pairs from an instructional video dataset.
Our hierarchical benchmark consists of video retrieval, moment retrieval, and two novel moment segmentation and step captioning tasks.
arXiv Detail & Related papers (2023-03-29T02:33:54Z) - VideoXum: Cross-modal Visual and Textural Summarization of Videos [54.0985975755278]
We propose a new joint video and text summarization task.
The goal is to generate both a shortened video clip along with the corresponding textual summary from a long video.
The generated shortened video clip and text narratives should be semantically well aligned.
arXiv Detail & Related papers (2023-03-21T17:51:23Z) - Learning to Summarize Videos by Contrasting Clips [1.3999481573773074]
Video summarization aims at choosing parts of a video that narrate a story as close as possible to the original one.
Most of the existing video summarization approaches focus on hand-crafted labels.
We propose contrastive learning as the answer to both questions.
arXiv Detail & Related papers (2023-01-12T18:55:30Z) - Video Summarization Based on Video-text Modelling [0.0]
We propose a multimodal self-supervised learning framework to obtain semantic representations of videos.
We also introduce a progressive video summarization method, where the important content in a video is pinpointed progressively to generate better summaries.
An objective evaluation framework is proposed to measure the quality of video summaries based on video classification.
arXiv Detail & Related papers (2022-01-07T15:21:46Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.