Implicit and Explicit Commonsense for Multi-sentence Video Captioning
- URL: http://arxiv.org/abs/2303.07545v2
- Date: Tue, 9 Jan 2024 01:30:38 GMT
- Title: Implicit and Explicit Commonsense for Multi-sentence Video Captioning
- Authors: Shih-Han Chou, James J. Little, Leonid Sigal
- Abstract summary: We propose a novel video captioning Transformer-based model that takes into account both implicit (visuo-lingual and purely linguistic) and explicit (knowledge-base) commonsense knowledge.
We show that these forms of knowledge, in isolation and in combination, enhance the quality of produced captions.
- Score: 33.969215964292395
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing dense or paragraph video captioning approaches rely on holistic
representations of videos, possibly coupled with learned object/action
representations, to condition hierarchical language decoders. However, they
fundamentally lack the commonsense knowledge of the world required to reason
about progression of events, causality, and even the function of certain
objects within a scene. To address this limitation we propose a novel video
captioning Transformer-based model, that takes into account both implicit
(visuo-lingual and purely linguistic) and explicit (knowledge-base) commonsense
knowledge. We show that these forms of knowledge, in isolation and in
combination, enhance the quality of produced captions. Further, inspired by
imitation learning, we propose a new task of instruction generation, where the
goal is to produce a set of linguistic instructions from a video demonstration
of its performance. We formalize the task using the ALFRED dataset [54]
generated using an AI2-THOR environment. While instruction generation is
conceptually similar to paragraph captioning, it differs in the fact that it
exhibits stronger object persistence, as well as spatially-aware and causal
sentence structure. We show that our commonsense knowledge enhanced approach
produces significant improvements on this task (up to 57% in METEOR and 8.5% in
CIDEr), as well as the state-of-the-art result on more traditional video
captioning in the ActivityNet Captions dataset [29].
Related papers
- OmniVid: A Generative Framework for Universal Video Understanding [133.73878582161387]
We seek to unify the output space of video understanding tasks by using languages as labels and additionally introducing time and box tokens.
This enables us to address various types of video tasks, including classification, captioning, and localization.
We demonstrate such a simple and straightforward idea is quite effective and can achieve state-of-the-art or competitive results.
arXiv Detail & Related papers (2024-03-26T17:59:24Z) - Generating Action-conditioned Prompts for Open-vocabulary Video Action
Recognition [63.95111791861103]
Existing methods typically adapt pretrained image-text models to the video domain.
We argue that augmenting text embeddings with human prior knowledge is pivotal for open-vocabulary video action recognition.
Our method not only sets new SOTA performance but also possesses excellent interpretability.
arXiv Detail & Related papers (2023-12-04T02:31:38Z) - A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In
Zero Shot [67.00455874279383]
We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video.
Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding.
To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
arXiv Detail & Related papers (2023-05-16T19:13:11Z) - Learning Transferable Spatiotemporal Representations from Natural Script
Knowledge [65.40899722211726]
We introduce a new pretext task, Turning to Video Transcript for ASR (TVTS), which sorts scripts by attending to learned video representations.
The advantages enable our model to contextualize what is happening like human beings and seamlessly apply to large-scale uncurated video data in the real world.
arXiv Detail & Related papers (2022-09-30T07:39:48Z) - Goal-driven text descriptions for images [7.059848512713061]
This thesis focuses on generating textual output given visual input.
We use a comprehension machine to guide the generated referring expressions to be more discriminative.
In Chapter 5, we study how training objectives and sampling methods affect the models' ability to generate diverse captions.
arXiv Detail & Related papers (2021-08-28T05:10:38Z) - Discriminative Latent Semantic Graph for Video Captioning [24.15455227330031]
Video captioning aims to automatically generate natural language sentences that describe the visual contents of a given video.
Our main contribution is to identify three key problems in a joint framework for future video summarization tasks.
arXiv Detail & Related papers (2021-08-08T15:11:20Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - Spatio-Temporal Graph for Video Captioning with Knowledge Distillation [50.034189314258356]
We propose a graph model for video captioning that exploits object interactions in space and time.
Our model builds interpretable links and is able to provide explicit visual grounding.
To avoid correlations caused by the variable number of objects, we propose an object-aware knowledge distillation mechanism.
arXiv Detail & Related papers (2020-03-31T03:58:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.