VidChapters-7M: Video Chapters at Scale
- URL: http://arxiv.org/abs/2309.13952v1
- Date: Mon, 25 Sep 2023 08:38:11 GMT
- Title: VidChapters-7M: Video Chapters at Scale
- Authors: Antoine Yang, Arsha Nagrani, Ivan Laptev, Josef Sivic, Cordelia Schmid
- Abstract summary: We present VidChapters-7M, a dataset of 817K user-chaptered videos including 7M chapters in total.
VidChapters-7M is automatically created from videos online in a scalable manner by scraping user-annotated chapters.
We show that pretraining on VidChapters-7M transfers well to dense video captioning tasks in both zero-shot and finetuning settings.
- Score: 110.19323390486775
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Segmenting long videos into chapters enables users to quickly navigate to the
information of their interest. This important topic has been understudied due
to the lack of publicly released datasets. To address this issue, we present
VidChapters-7M, a dataset of 817K user-chaptered videos including 7M chapters
in total. VidChapters-7M is automatically created from videos online in a
scalable manner by scraping user-annotated chapters and hence without any
additional manual annotation. We introduce the following three tasks based on
this data. First, the video chapter generation task consists of temporally
segmenting the video and generating a chapter title for each segment. To
further dissect the problem, we also define two variants of this task: video
chapter generation given ground-truth boundaries, which requires generating a
chapter title given an annotated video segment, and video chapter grounding,
which requires temporally localizing a chapter given its annotated title. We
benchmark both simple baselines and state-of-the-art video-language models for
these three tasks. We also show that pretraining on VidChapters-7M transfers
well to dense video captioning tasks in both zero-shot and finetuning settings,
largely improving the state of the art on the YouCook2 and ViTT benchmarks.
Finally, our experiments reveal that downstream performance scales well with
the size of the pretraining dataset. Our dataset, code, and models are publicly
available at https://antoyang.github.io/vidchapters.html.
Related papers
- PM-VIS+: High-Performance Video Instance Segmentation without Video Annotation [15.9587266448337]
Video instance segmentation requires detecting, segmenting, and tracking objects in videos.
This paper introduces a method that eliminates video annotations by utilizing image datasets.
arXiv Detail & Related papers (2024-06-28T05:22:39Z) - Vript: A Video Is Worth Thousands of Words [54.815686588378156]
Vript is an annotated corpus of 12K high-resolution videos, offering detailed, dense, and script-like captions for over 420K clips.
Each clip has a caption of 145 words, which is over 10x longer than most video-text datasets.
Vript is a powerful model capable of end-to-end generation of dense and detailed captions for long videos.
arXiv Detail & Related papers (2024-06-10T06:17:55Z) - Towards Open-Vocabulary Video Instance Segmentation [61.469232166803465]
Video Instance aims at segmenting and categorizing objects in videos from a closed set of training categories.
We introduce the novel task of Open-Vocabulary Video Instance, which aims to simultaneously segment, track, and classify objects in videos from open-set categories.
To benchmark Open-Vocabulary VIS, we collect a Large-Vocabulary Video Instance dataset (LV-VIS), that contains well-annotated objects from 1,196 diverse categories.
arXiv Detail & Related papers (2023-04-04T11:25:23Z) - Hierarchical Video-Moment Retrieval and Step-Captioning [68.4859260853096]
HiREST consists of 3.4K text-video pairs from an instructional video dataset.
Our hierarchical benchmark consists of video retrieval, moment retrieval, and two novel moment segmentation and step captioning tasks.
arXiv Detail & Related papers (2023-03-29T02:33:54Z) - Multi-modal Video Chapter Generation [11.658507807110645]
We introduce a new dataset called Chapter-Gen, which consists of approximately 10k user-generated videos with annotated chapter information.
Our data collection procedure is fast, scalable and does not require any additional manual annotation.
Our experiments demonstrate that the proposed framework achieves superior results over existing methods.
arXiv Detail & Related papers (2022-09-26T13:44:48Z) - Visual Subtitle Feature Enhanced Video Outline Generation [23.831220964676973]
We introduce a novel video understanding task, namely video outline generation (VOG)
To learn and evaluate VOG, we annotate a 10k+ dataset, called DuVOG.
We propose a Visual Subtitle feature Enhanced video outline generation model (VSENet)
arXiv Detail & Related papers (2022-08-24T05:26:26Z) - TL;DW? Summarizing Instructional Videos with Task Relevance &
Cross-Modal Saliency [133.75876535332003]
We focus on summarizing instructional videos, an under-explored area of video summarization.
Existing video summarization datasets rely on manual frame-level annotations.
We propose an instructional video summarization network that combines a context-aware temporal video encoder and a segment scoring transformer.
arXiv Detail & Related papers (2022-08-14T04:07:40Z) - Video Panoptic Segmentation [117.08520543864054]
We propose and explore a new video extension of this task, called video panoptic segmentation.
To invigorate research on this new task, we present two types of video panoptic datasets.
We propose a novel video panoptic segmentation network (VPSNet) which jointly predicts object classes, bounding boxes, masks, instance id tracking, and semantic segmentation in video frames.
arXiv Detail & Related papers (2020-06-19T19:35:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.