Text Synopsis Generation for Egocentric Videos
- URL: http://arxiv.org/abs/2005.03804v2
- Date: Mon, 21 Sep 2020 16:29:38 GMT
- Title: Text Synopsis Generation for Egocentric Videos
- Authors: Aidean Sharghi, Niels da Vitoria Lobo, Mubarak Shah
- Abstract summary: We propose to generate a textual synopsis, consisting of a few sentences describing the most important events in a long egocentric videos.
Users can read the short text to gain insight about the video, and more importantly, efficiently search through the content of a large video database.
- Score: 72.52130695707008
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mass utilization of body-worn cameras has led to a huge corpus of available
egocentric video. Existing video summarization algorithms can accelerate
browsing such videos by selecting (visually) interesting shots from them.
Nonetheless, since the system user still has to watch the summary videos,
browsing large video databases remain a challenge. Hence, in this work, we
propose to generate a textual synopsis, consisting of a few sentences
describing the most important events in a long egocentric videos. Users can
read the short text to gain insight about the video, and more importantly,
efficiently search through the content of a large video database using text
queries. Since egocentric videos are long and contain many activities and
events, using video-to-text algorithms results in thousands of descriptions,
many of which are incorrect. Therefore, we propose a multi-task learning scheme
to simultaneously generate descriptions for video segments and summarize the
resulting descriptions in an end-to-end fashion. We Input a set of video shots
and the network generates a text description for each shot. Next,
visual-language content matching unit that is trained with a weakly supervised
objective, identifies the correct descriptions. Finally, the last component of
our network, called purport network, evaluates the descriptions all together to
select the ones containing crucial information. Out of thousands of
descriptions generated for the video, a few informative sentences are returned
to the user. We validate our framework on the challenging UT Egocentric video
dataset, where each video is between 3 to 5 hours long, associated with over
3000 textual descriptions on average. The generated textual summaries,
including only 5 percent (or less) of the generated descriptions, are compared
to groundtruth summaries in text domain using well-established metrics in
natural language processing.
Related papers
- Shot2Story20K: A New Benchmark for Comprehensive Understanding of
Multi-shot Videos [58.13927287437394]
We present a new multi-shot video understanding benchmark Shot2Story20K with detailed shot-level captions and comprehensive video summaries.
Preliminary experiments show some challenges to generate a long and comprehensive video summary.
arXiv Detail & Related papers (2023-12-16T03:17:30Z) - A Video is Worth 10,000 Words: Training and Benchmarking with Diverse
Captions for Better Long Video Retrieval [43.58794386905177]
Existing long video retrieval systems are trained and tested in the paragraph-to-video retrieval regime.
This neglects the richness and variety of possible valid descriptions of a video.
We propose a pipeline that leverages state-of-the-art large language models to carefully generate a diverse set of synthetic captions for long videos.
arXiv Detail & Related papers (2023-11-30T18:59:45Z) - StoryBench: A Multifaceted Benchmark for Continuous Story Visualization [42.439670922813434]
We introduce StoryBench: a new, challenging multi-task benchmark to reliably evaluate text-to-video models.
Our benchmark includes three video generation tasks of increasing difficulty: action execution, story continuation, and story generation.
We evaluate small yet strong text-to-video baselines, and show the benefits of training on story-like data algorithmically generated from existing video captions.
arXiv Detail & Related papers (2023-08-22T17:53:55Z) - InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding
and Generation [90.71796406228265]
InternVid is a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations.
The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words.
arXiv Detail & Related papers (2023-07-13T17:58:32Z) - A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In
Zero Shot [67.00455874279383]
We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video.
Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding.
To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
arXiv Detail & Related papers (2023-05-16T19:13:11Z) - VideoXum: Cross-modal Visual and Textural Summarization of Videos [54.0985975755278]
We propose a new joint video and text summarization task.
The goal is to generate both a shortened video clip along with the corresponding textual summary from a long video.
The generated shortened video clip and text narratives should be semantically well aligned.
arXiv Detail & Related papers (2023-03-21T17:51:23Z) - TL;DW? Summarizing Instructional Videos with Task Relevance &
Cross-Modal Saliency [133.75876535332003]
We focus on summarizing instructional videos, an under-explored area of video summarization.
Existing video summarization datasets rely on manual frame-level annotations.
We propose an instructional video summarization network that combines a context-aware temporal video encoder and a segment scoring transformer.
arXiv Detail & Related papers (2022-08-14T04:07:40Z) - Bridging Vision and Language from the Video-to-Text Perspective: A
Comprehensive Review [1.0520692160489133]
This review categorizes and describes the state-of-the-art techniques for the video-to-text problem.
It covers the main video-to-text methods and the ways to evaluate their performance.
State-of-the-art techniques are still a long way from achieving human-like performance in generating or retrieving video descriptions.
arXiv Detail & Related papers (2021-03-27T02:12:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.