iPerceive: Applying Common-Sense Reasoning to Multi-Modal Dense Video
Captioning and Video Question Answering
- URL: http://arxiv.org/abs/2011.07735v1
- Date: Mon, 16 Nov 2020 05:44:45 GMT
- Title: iPerceive: Applying Common-Sense Reasoning to Multi-Modal Dense Video
Captioning and Video Question Answering
- Authors: Aman Chadha, Gurneet Arora, Navpreet Kaloty
- Abstract summary: We propose iPer, a framework capable of understanding the "why" between events in a video.
We demonstrate the effectiveness of iPerceive and VideoQA as machine translation problems.
Our approach furthers the state-of-the-art in visual understanding.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most prior art in visual understanding relies solely on analyzing the "what"
(e.g., event recognition) and "where" (e.g., event localization), which in some
cases, fails to describe correct contextual relationships between events or
leads to incorrect underlying visual attention. Part of what defines us as
human and fundamentally different from machines is our instinct to seek
causality behind any association, say an event Y that happened as a direct
result of event X. To this end, we propose iPerceive, a framework capable of
understanding the "why" between events in a video by building a common-sense
knowledge base using contextual cues to infer causal relationships between
objects in the video. We demonstrate the effectiveness of our technique using
the dense video captioning (DVC) and video question answering (VideoQA) tasks.
Furthermore, while most prior work in DVC and VideoQA relies solely on visual
information, other modalities such as audio and speech are vital for a human
observer's perception of an environment. We formulate DVC and VideoQA tasks as
machine translation problems that utilize multiple modalities. By evaluating
the performance of iPerceive DVC and iPerceive VideoQA on the ActivityNet
Captions and TVQA datasets respectively, we show that our approach furthers the
state-of-the-art. Code and samples are available at: iperceive.amanchadha.com.
Related papers
- SAV-SE: Scene-aware Audio-Visual Speech Enhancement with Selective State Space Model [35.60147467774199]
SAV-SE is first proposal to use rich contextual information from synchronized video as auxiliary cues to indicate the type of noise.
To our best knowledge, this is the first proposal to use rich contextual information from synchronized video as auxiliary cues to indicate the type of noise, which eventually improves the speech enhancement performance.
arXiv Detail & Related papers (2024-11-12T12:23:41Z) - EA-VTR: Event-Aware Video-Text Retrieval [97.30850809266725]
Event-Aware Video-Text Retrieval model achieves powerful video-text retrieval ability through superior video event awareness.
EA-VTR can efficiently encode frame-level and video-level visual representations simultaneously, enabling detailed event content and complex event temporal cross-modal alignment.
arXiv Detail & Related papers (2024-07-10T09:09:58Z) - A Survey of Video Datasets for Grounded Event Understanding [34.11140286628736]
multimodal AI systems must be capable of well-rounded common-sense reasoning akin to human visual understanding.
We survey 105 video datasets that require event understanding capability.
arXiv Detail & Related papers (2024-06-14T00:36:55Z) - Dense Video Captioning: A Survey of Techniques, Datasets and Evaluation
Protocols [53.706461356853445]
Untrimmed videos have interrelated events, dependencies, context, overlapping events, object-object interactions, domain specificity, and other semantics worth describing.
Video Captioning (DVC) aims at detecting and describing different events in a given video.
arXiv Detail & Related papers (2023-11-05T01:45:31Z) - Visual Causal Scene Refinement for Video Question Answering [117.08431221482638]
We present a causal analysis of VideoQA and propose a framework for cross-modal causal reasoning, named Visual Causal Scene Refinement (VCSR)
Our VCSR involves two essential modules, which refines consecutive video frames guided by the question semantics to obtain more representative segment features for causal front-door intervention.
Experiments on the NExT-QA, Causal-VidQA, and MSRVTT-QA datasets demonstrate the superiority of our VCSR in discovering visual causal scene and achieving robust video question answering.
arXiv Detail & Related papers (2023-05-07T09:05:19Z) - A Review of Deep Learning for Video Captioning [111.1557921247882]
Video captioning (VC) is a fast-moving, cross-disciplinary area of research.
This survey covers deep learning-based VC, including but not limited to, attention-based architectures, graph networks, reinforcement learning, adversarial networks, dense video captioning (DVC)
arXiv Detail & Related papers (2023-04-22T15:30:54Z) - Look Before you Speak: Visually Contextualized Utterances [88.58909442073858]
We create a task for predicting utterances in a video using both visual frames and transcribed speech as context.
By exploiting the large number of instructional videos online, we train a model to solve this task at scale, without the need for manual annotations.
Our model achieves state-of-the-art performance on a number of downstream VideoQA benchmarks.
arXiv Detail & Related papers (2020-12-10T14:47:02Z) - Multi-modal Dense Video Captioning [18.592384822257948]
We present a new dense video captioning approach that is able to utilize any number of modalities for event description.
We show how audio and speech modalities may improve a dense video captioning model.
arXiv Detail & Related papers (2020-03-17T15:15:17Z) - Video2Commonsense: Generating Commonsense Descriptions to Enrich Video
Captioning [56.97139024595429]
In videos that involve active agents such as humans, the agent's actions can bring about myriad changes in the scene.
Unlike images, actions in videos are also inherently linked to social aspects such as intentions (why the action is taking place), effects (what changes due to the action), and attributes that describe the agent.
We present the first work on generating commonsense captions directly from videos, to describe latent aspects such as intentions, effects, and attributes.
arXiv Detail & Related papers (2020-03-11T08:42:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.