Visual Subtitle Feature Enhanced Video Outline Generation
- URL: http://arxiv.org/abs/2208.11307v1
- Date: Wed, 24 Aug 2022 05:26:26 GMT
- Title: Visual Subtitle Feature Enhanced Video Outline Generation
- Authors: Qi Lv, Ziqiang Cao, Wenrui Xie, Derui Wang, Jingwen Wang, Zhiyong Hu,
Tangkun Zhang, Yuan Ba, Yuanhang Li, Min Cao, Wenjie Li, Sujian Li, Guohong
Fu
- Abstract summary: We introduce a novel video understanding task, namely video outline generation (VOG)
To learn and evaluate VOG, we annotate a 10k+ dataset, called DuVOG.
We propose a Visual Subtitle feature Enhanced video outline generation model (VSENet)
- Score: 23.831220964676973
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: With the tremendously increasing number of videos, there is a great demand
for techniques that help people quickly navigate to the video segments they are
interested in. However, current works on video understanding mainly focus on
video content summarization, while little effort has been made to explore the
structure of a video. Inspired by textual outline generation, we introduce a
novel video understanding task, namely video outline generation (VOG). This
task is defined to contain two sub-tasks: (1) first segmenting the video
according to the content structure and then (2) generating a heading for each
segment. To learn and evaluate VOG, we annotate a 10k+ dataset, called DuVOG.
Specifically, we use OCR tools to recognize subtitles of videos. Then
annotators are asked to divide subtitles into chapters and title each chapter.
In videos, highlighted text tends to be the headline since it is more likely to
attract attention. Therefore we propose a Visual Subtitle feature Enhanced
video outline generation model (VSENet) which takes as input the textual
subtitles together with their visual font sizes and positions. We consider the
VOG task as a sequence tagging problem that extracts spans where the headings
are located and then rewrites them to form the final outlines. Furthermore,
based on the similarity between video outlines and textual outlines, we use a
large number of articles with chapter headings to pretrain our model.
Experiments on DuVOG show that our model largely outperforms other baseline
methods, achieving 77.1 of F1-score for the video segmentation level and 85.0
of ROUGE-L_F0.5 for the headline generation level.
Related papers
- Vript: A Video Is Worth Thousands of Words [54.815686588378156]
Vript is an annotated corpus of 12K high-resolution videos, offering detailed, dense, and script-like captions for over 420K clips.
Each clip has a caption of 145 words, which is over 10x longer than most video-text datasets.
Vript is a powerful model capable of end-to-end generation of dense and detailed captions for long videos.
arXiv Detail & Related papers (2024-06-10T06:17:55Z) - VidChapters-7M: Video Chapters at Scale [110.19323390486775]
We present VidChapters-7M, a dataset of 817K user-chaptered videos including 7M chapters in total.
VidChapters-7M is automatically created from videos online in a scalable manner by scraping user-annotated chapters.
We show that pretraining on VidChapters-7M transfers well to dense video captioning tasks in both zero-shot and finetuning settings.
arXiv Detail & Related papers (2023-09-25T08:38:11Z) - Hierarchical Video-Moment Retrieval and Step-Captioning [68.4859260853096]
HiREST consists of 3.4K text-video pairs from an instructional video dataset.
Our hierarchical benchmark consists of video retrieval, moment retrieval, and two novel moment segmentation and step captioning tasks.
arXiv Detail & Related papers (2023-03-29T02:33:54Z) - VideoXum: Cross-modal Visual and Textural Summarization of Videos [54.0985975755278]
We propose a new joint video and text summarization task.
The goal is to generate both a shortened video clip along with the corresponding textual summary from a long video.
The generated shortened video clip and text narratives should be semantically well aligned.
arXiv Detail & Related papers (2023-03-21T17:51:23Z) - HierVL: Learning Hierarchical Video-Language Embeddings [108.77600799637172]
HierVL is a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations.
We introduce a hierarchical contrastive training objective that encourages text-visual alignment at both the clip level and video level.
Our hierarchical scheme yields a clip representation that outperforms its single-level counterpart as well as a long-term video representation that achieves SotA.
arXiv Detail & Related papers (2023-01-05T21:53:19Z) - TeViS:Translating Text Synopses to Video Storyboards [30.388090248346504]
We propose a new task called Text synopsis to Video Storyboard (TeViS)
It aims to retrieve an ordered sequence of images as the video storyboard to visualize the text synopsis.
VQ-Trans first encodes text synopsis and images into a joint embedding space and uses vector quantization (VQ) to improve the visual representation.
arXiv Detail & Related papers (2022-12-31T06:32:36Z) - TL;DW? Summarizing Instructional Videos with Task Relevance &
Cross-Modal Saliency [133.75876535332003]
We focus on summarizing instructional videos, an under-explored area of video summarization.
Existing video summarization datasets rely on manual frame-level annotations.
We propose an instructional video summarization network that combines a context-aware temporal video encoder and a segment scoring transformer.
arXiv Detail & Related papers (2022-08-14T04:07:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.