Multimodal Pretraining for Dense Video Captioning
- URL: http://arxiv.org/abs/2011.11760v1
- Date: Tue, 10 Nov 2020 21:49:14 GMT
- Title: Multimodal Pretraining for Dense Video Captioning
- Authors: Gabriel Huang, Bo Pang, Zhenhai Zhu, Clara Rivera, Radu Soricut
- Abstract summary: We construct and release a new dense video captioning dataset, Video Timeline Tags (ViTT)
We explore several multimodal sequence-to-sequence pretraining strategies that leverage large unsupervised datasets of videos and caption-like texts.
We show that such models generalize well and are robust over a wide variety of instructional videos.
- Score: 26.39052753539932
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning specific hands-on skills such as cooking, car maintenance, and home
repairs increasingly happens via instructional videos. The user experience with
such videos is known to be improved by meta-information such as time-stamped
annotations for the main steps involved. Generating such annotations
automatically is challenging, and we describe here two relevant contributions.
First, we construct and release a new dense video captioning dataset, Video
Timeline Tags (ViTT), featuring a variety of instructional videos together with
time-stamped annotations. Second, we explore several multimodal
sequence-to-sequence pretraining strategies that leverage large unsupervised
datasets of videos and caption-like texts. We pretrain and subsequently
finetune dense video captioning models using both YouCook2 and ViTT. We show
that such models generalize well and are robust over a wide variety of
instructional videos.
Related papers
- Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning [71.94122309290537]
We propose an efficient, online approach to generate dense captions for videos.
Our model uses a novel autoregressive factorized decoding architecture.
Our approach shows excellent performance compared to both offline and online methods, and uses 20% less compute.
arXiv Detail & Related papers (2024-11-22T02:46:44Z) - Multimodal Language Models for Domain-Specific Procedural Video Summarization [0.0]
We study the use of multimodal models to enhance video summarization and step-by-step instruction generation within specific domains.
Our approach focuses on fine-tuning TimeChat to improve its performance in specific domains: cooking and medical procedures.
Our findings indicate that when finetuned on domain-specific procedural data, TimeChat can significantly improve the extraction and summarization of key instructional steps in long-format videos.
arXiv Detail & Related papers (2024-07-07T15:50:46Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools [44.78291853329394]
textbfVidCoM is a fast adaptive framework that leverages Large Language Models (LLMs) to reason about videos using lightweight visual tools.
An InsOVER algorithm locates the corresponding video events based on an efficient Hungarian matching between decompositions of linguistic instructions and video events.
arXiv Detail & Related papers (2023-10-16T17:05:56Z) - InstructVid2Vid: Controllable Video Editing with Natural Language Instructions [97.17047888215284]
InstructVid2Vid is an end-to-end diffusion-based methodology for video editing guided by human language instructions.
Our approach empowers video manipulation guided by natural language directives, eliminating the need for per-example fine-tuning or inversion.
arXiv Detail & Related papers (2023-05-21T03:28:13Z) - TL;DW? Summarizing Instructional Videos with Task Relevance &
Cross-Modal Saliency [133.75876535332003]
We focus on summarizing instructional videos, an under-explored area of video summarization.
Existing video summarization datasets rely on manual frame-level annotations.
We propose an instructional video summarization network that combines a context-aware temporal video encoder and a segment scoring transformer.
arXiv Detail & Related papers (2022-08-14T04:07:40Z) - End-to-end Generative Pretraining for Multimodal Video Captioning [82.79187814057313]
We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining framework for learning from unlabelled videos.
Unlike recent video-language pretraining frameworks, our framework trains both a multimodal video encoder and a sentence decoder jointly.
Our model achieves state-of-the-art performance for multimodal video captioning on four standard benchmarks.
arXiv Detail & Related papers (2022-01-20T16:16:21Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.