TeViS:Translating Text Synopses to Video Storyboards
- URL: http://arxiv.org/abs/2301.00135v4
- Date: Tue, 29 Aug 2023 13:10:56 GMT
- Title: TeViS:Translating Text Synopses to Video Storyboards
- Authors: Xu Gu, Yuchong Sun, Feiyue Ni, Shizhe Chen, Xihua Wang, Ruihua Song,
Boyuan Li, Xiang Cao
- Abstract summary: We propose a new task called Text synopsis to Video Storyboard (TeViS)
It aims to retrieve an ordered sequence of images as the video storyboard to visualize the text synopsis.
VQ-Trans first encodes text synopsis and images into a joint embedding space and uses vector quantization (VQ) to improve the visual representation.
- Score: 30.388090248346504
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A video storyboard is a roadmap for video creation which consists of
shot-by-shot images to visualize key plots in a text synopsis. Creating video
storyboards, however, remains challenging which not only requires cross-modal
association between high-level texts and images but also demands long-term
reasoning to make transitions smooth across shots. In this paper, we propose a
new task called Text synopsis to Video Storyboard (TeViS) which aims to
retrieve an ordered sequence of images as the video storyboard to visualize the
text synopsis. We construct a MovieNet-TeViS dataset based on the public
MovieNet dataset. It contains 10K text synopses each paired with keyframes
manually selected from corresponding movies by considering both relevance and
cinematic coherence. To benchmark the task, we present strong CLIP-based
baselines and a novel VQ-Trans. VQ-Trans first encodes text synopsis and images
into a joint embedding space and uses vector quantization (VQ) to improve the
visual representation. Then, it auto-regressively generates a sequence of
visual features for retrieval and ordering. Experimental results demonstrate
that VQ-Trans significantly outperforms prior methods and the CLIP-based
baselines. Nevertheless, there is still a large gap compared to human
performance suggesting room for promising future work. The code and data are
available at: \url{https://ruc-aimind.github.io/projects/TeViS/}
Related papers
- Vript: A Video Is Worth Thousands of Words [54.815686588378156]
Vript is an annotated corpus of 12K high-resolution videos, offering detailed, dense, and script-like captions for over 420K clips.
Each clip has a caption of 145 words, which is over 10x longer than most video-text datasets.
Vript is a powerful model capable of end-to-end generation of dense and detailed captions for long videos.
arXiv Detail & Related papers (2024-06-10T06:17:55Z) - Weakly Supervised Video Representation Learning with Unaligned Text for
Sequential Videos [39.42509966219001]
This paper studies weakly supervised sequential video understanding where the accurate time-level text-video alignment is not provided.
We use a transformer to aggregate frame-level features for video representation and use a pre-trained text encoder to encode the texts corresponding to each action and the whole video.
Experiments on video sequence verification and text-to-video matching show that our method outperforms baselines by a large margin.
arXiv Detail & Related papers (2023-03-22T08:13:25Z) - VideoXum: Cross-modal Visual and Textural Summarization of Videos [54.0985975755278]
We propose a new joint video and text summarization task.
The goal is to generate both a shortened video clip along with the corresponding textual summary from a long video.
The generated shortened video clip and text narratives should be semantically well aligned.
arXiv Detail & Related papers (2023-03-21T17:51:23Z) - Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval? [131.300931102986]
In real-world scenarios, online videos are often accompanied by relevant text information such as titles, tags, and even subtitles.
We propose a novel approach to text-video retrieval, where we directly generate associated captions from videos using zero-shot video captioning.
We conduct comprehensive ablation studies to demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2022-12-31T11:50:32Z) - Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners [47.59597017035785]
We present VideoCoCa that reuses a pretrained image-text contrastive captioner (CoCa) model and adapt it to video-text tasks.
VideoCoCa's zero-shot transfer baseline already achieves state-of-the-art results on zero-shot video classification.
Our approach establishes a simple and effective video-text baseline for future research.
arXiv Detail & Related papers (2022-12-09T16:39:09Z) - Character-Centric Story Visualization via Visual Planning and Token
Alignment [53.44760407148918]
Story visualization advances the traditional text-to-image generation by enabling multiple image generation based on a complete story.
Key challenge of consistent story visualization is to preserve characters that are essential in stories.
We propose to adapt a recent work that augments Vector-Quantized Variational Autoencoders with a text-tovisual-token architecture.
arXiv Detail & Related papers (2022-10-16T06:50:39Z) - Contrastive Video-Language Learning with Fine-grained Frame Sampling [54.542962813921214]
FineCo is an approach to better learn video and language representations with a fine-grained contrastive objective operating on video frames.
It helps distil a video by selecting the frames that are semantically equivalent to the text, improving cross-modal correspondence.
arXiv Detail & Related papers (2022-10-10T22:48:08Z) - Visual Subtitle Feature Enhanced Video Outline Generation [23.831220964676973]
We introduce a novel video understanding task, namely video outline generation (VOG)
To learn and evaluate VOG, we annotate a 10k+ dataset, called DuVOG.
We propose a Visual Subtitle feature Enhanced video outline generation model (VSENet)
arXiv Detail & Related papers (2022-08-24T05:26:26Z) - X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval [26.581384985173116]
In text-video retrieval, the objective is to learn a cross-modal similarity function between a text and a video.
We propose a cross-modal attention model called X-Pool that reasons between a text and the frames of a video.
arXiv Detail & Related papers (2022-03-28T20:47:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.