Narration Generation for Cartoon Videos
- URL: http://arxiv.org/abs/2101.06803v1
- Date: Sun, 17 Jan 2021 23:23:09 GMT
- Title: Narration Generation for Cartoon Videos
- Authors: Nikos Papasarantopoulos, Shay B. Cohen
- Abstract summary: We propose a new task, narration generation, that is complementing videos with narration texts that are to be interjected in several places.
We collect a new dataset from the animated television series Peppa Pig.
- Score: 35.814965300322015
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Research on text generation from multimodal inputs has largely focused on
static images, and less on video data. In this paper, we propose a new task,
narration generation, that is complementing videos with narration texts that
are to be interjected in several places. The narrations are part of the video
and contribute to the storyline unfolding in it. Moreover, they are
context-informed, since they include information appropriate for the timeframe
of video they cover, and also, do not need to include every detail shown in
input scenes, as a caption would. We collect a new dataset from the animated
television series Peppa Pig. Furthermore, we formalize the task of narration
generation as including two separate tasks, timing and content generation, and
present a set of models on the new task.
Related papers
- Shot2Story20K: A New Benchmark for Comprehensive Understanding of
Multi-shot Videos [58.13927287437394]
We present a new multi-shot video understanding benchmark Shot2Story20K with detailed shot-level captions and comprehensive video summaries.
Preliminary experiments show some challenges to generate a long and comprehensive video summary.
arXiv Detail & Related papers (2023-12-16T03:17:30Z) - StoryBench: A Multifaceted Benchmark for Continuous Story Visualization [42.439670922813434]
We introduce StoryBench: a new, challenging multi-task benchmark to reliably evaluate text-to-video models.
Our benchmark includes three video generation tasks of increasing difficulty: action execution, story continuation, and story generation.
We evaluate small yet strong text-to-video baselines, and show the benefits of training on story-like data algorithmically generated from existing video captions.
arXiv Detail & Related papers (2023-08-22T17:53:55Z) - VideoXum: Cross-modal Visual and Textural Summarization of Videos [54.0985975755278]
We propose a new joint video and text summarization task.
The goal is to generate both a shortened video clip along with the corresponding textual summary from a long video.
The generated shortened video clip and text narratives should be semantically well aligned.
arXiv Detail & Related papers (2023-03-21T17:51:23Z) - Connecting Vision and Language with Video Localized Narratives [54.094554472715245]
We propose Video Localized Narratives, a new form of multimodal video annotations connecting vision and language.
In the original Localized Narratives, annotators speak and move their mouse simultaneously on an image, thus grounding each word with a mouse trace segment.
Our new protocol empowers annotators to tell the story of a video with Localized Narratives, capturing even complex events involving multiple actors interacting with each other and with several passive objects.
arXiv Detail & Related papers (2023-02-22T09:04:00Z) - QuerYD: A video dataset with high-quality text and audio narrations [85.6468286746623]
We introduce QuerYD, a new large-scale dataset for retrieval and event localisation in video.
A unique feature of our dataset is the availability of two audio tracks for each video: the original audio, and a high-quality spoken description.
The dataset is based on YouDescribe, a volunteer project that assists visually-impaired people by attaching voiced narrations to existing YouTube videos.
arXiv Detail & Related papers (2020-11-22T17:33:44Z) - VIOLIN: A Large-Scale Dataset for Video-and-Language Inference [103.7457132841367]
We introduce a new task, Video-and-Language Inference, for joint multimodal understanding of video and text.
Given a video clip with subtitles aligned as premise, paired with a natural language hypothesis based on the video content, a model needs to infer whether the hypothesis is entailed or contradicted by the given video clip.
A new large-scale dataset, named Violin (VIdeO-and-Language INference), is introduced for this task, which consists of 95,322 video-hypothesis pairs from 15,887 video clips.
arXiv Detail & Related papers (2020-03-25T20:39:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.