Video2Commonsense: Generating Commonsense Descriptions to Enrich Video
Captioning
- URL: http://arxiv.org/abs/2003.05162v3
- Date: Fri, 16 Oct 2020 02:08:26 GMT
- Title: Video2Commonsense: Generating Commonsense Descriptions to Enrich Video
Captioning
- Authors: Zhiyuan Fang, Tejas Gokhale, Pratyay Banerjee, Chitta Baral, Yezhou
Yang
- Abstract summary: In videos that involve active agents such as humans, the agent's actions can bring about myriad changes in the scene.
Unlike images, actions in videos are also inherently linked to social aspects such as intentions (why the action is taking place), effects (what changes due to the action), and attributes that describe the agent.
We present the first work on generating commonsense captions directly from videos, to describe latent aspects such as intentions, effects, and attributes.
- Score: 56.97139024595429
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Captioning is a crucial and challenging task for video understanding. In
videos that involve active agents such as humans, the agent's actions can bring
about myriad changes in the scene. Observable changes such as movements,
manipulations, and transformations of the objects in the scene, are reflected
in conventional video captioning. Unlike images, actions in videos are also
inherently linked to social aspects such as intentions (why the action is
taking place), effects (what changes due to the action), and attributes that
describe the agent. Thus for video understanding, such as when captioning
videos or when answering questions about videos, one must have an understanding
of these commonsense aspects. We present the first work on generating
commonsense captions directly from videos, to describe latent aspects such as
intentions, effects, and attributes. We present a new dataset
"Video-to-Commonsense (V2C)" that contains $\sim9k$ videos of human agents
performing various actions, annotated with 3 types of commonsense descriptions.
Additionally we explore the use of open-ended video-based commonsense question
answering (V2C-QA) as a way to enrich our captions. Both the generation task
and the QA task can be used to enrich video captions.
Related papers
- Storyboard guided Alignment for Fine-grained Video Action Recognition [32.02631248389487]
Fine-grained video action recognition can be conceptualized as a video-text matching problem.
We propose a multi-granularity framework based on two observations: (i) videos with different global semantics may share similar atomic actions or appearances, and (ii) atomic actions within a video can be momentary, slow, or even non-directly related to the global video semantics.
arXiv Detail & Related papers (2024-10-18T07:40:41Z) - RACCooN: A Versatile Instructional Video Editing Framework with Auto-Generated Narratives [58.15403987979496]
This paper proposes RACCooN, a versatile and user-friendly video-to-paragraph-to-video generative framework.
Our video generative model incorporates auto-generated narratives or instructions to enhance the quality and accuracy of the generated content.
The proposed framework demonstrates impressive versatile capabilities in video-to-paragraph generation, video content editing, and can be incorporated into other SoTA video generative models for further enhancement.
arXiv Detail & Related papers (2024-05-28T17:46:36Z) - Shot2Story20K: A New Benchmark for Comprehensive Understanding of
Multi-shot Videos [58.13927287437394]
We present a new multi-shot video understanding benchmark Shot2Story20K with detailed shot-level captions and comprehensive video summaries.
Preliminary experiments show some challenges to generate a long and comprehensive video summary.
arXiv Detail & Related papers (2023-12-16T03:17:30Z) - Generating Action-conditioned Prompts for Open-vocabulary Video Action
Recognition [63.95111791861103]
Existing methods typically adapt pretrained image-text models to the video domain.
We argue that augmenting text embeddings with human prior knowledge is pivotal for open-vocabulary video action recognition.
Our method not only sets new SOTA performance but also possesses excellent interpretability.
arXiv Detail & Related papers (2023-12-04T02:31:38Z) - Dense Video Captioning: A Survey of Techniques, Datasets and Evaluation
Protocols [53.706461356853445]
Untrimmed videos have interrelated events, dependencies, context, overlapping events, object-object interactions, domain specificity, and other semantics worth describing.
Video Captioning (DVC) aims at detecting and describing different events in a given video.
arXiv Detail & Related papers (2023-11-05T01:45:31Z) - FunQA: Towards Surprising Video Comprehension [64.58663825184958]
We introduce FunQA, a challenging video question-answering dataset.
FunQA covers three previously unexplored types of surprising videos: HumorQA, CreativeQA, and MagicQA.
In total, the FunQA benchmark consists of 312K free-text QA pairs derived from 4.3K video clips.
arXiv Detail & Related papers (2023-06-26T17:59:55Z) - Implicit and Explicit Commonsense for Multi-sentence Video Captioning [33.969215964292395]
We propose a novel video captioning Transformer-based model that takes into account both implicit (visuo-lingual and purely linguistic) and explicit (knowledge-base) commonsense knowledge.
We show that these forms of knowledge, in isolation and in combination, enhance the quality of produced captions.
arXiv Detail & Related papers (2023-03-14T00:19:11Z) - Saying the Unseen: Video Descriptions via Dialog Agents [37.16726118481626]
We introduce a novel task that aims to describe a video using the natural language dialog between two agents.
Q-BOT is given two semantic segmented frames from the beginning and the end of the video.
A-BOT, the other agent who has access to the entire video, assists Q-BOT to accomplish the goal by answering the asked questions.
arXiv Detail & Related papers (2021-06-26T17:36:31Z) - iPerceive: Applying Common-Sense Reasoning to Multi-Modal Dense Video
Captioning and Video Question Answering [0.0]
We propose iPer, a framework capable of understanding the "why" between events in a video.
We demonstrate the effectiveness of iPerceive and VideoQA as machine translation problems.
Our approach furthers the state-of-the-art in visual understanding.
arXiv Detail & Related papers (2020-11-16T05:44:45Z) - Improved Actor Relation Graph based Group Activity Recognition [0.0]
The detailed description of human actions and group activities is essential information, which can be used in real-time CCTV video surveillance, health care, sports video analysis, etc.
This study proposes a video understanding method that mainly focused on group activity recognition by learning the pair-wise actor appearance similarity and actor positions.
arXiv Detail & Related papers (2020-10-24T19:46:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.