Cross-Modal Reasoning with Event Correlation for Video Question
Answering
- URL: http://arxiv.org/abs/2312.12721v1
- Date: Wed, 20 Dec 2023 02:30:39 GMT
- Title: Cross-Modal Reasoning with Event Correlation for Video Question
Answering
- Authors: Chengxiang Yin, Zhengping Che, Kun Wu, Zhiyuan Xu, Qinru Qiu, Jian
Tang
- Abstract summary: We introduce the dense caption modality as a new auxiliary and distill event-correlated information from it to infer the correct answer.
We employ cross-modal reasoning modules for explicitly modeling inter-modal relationships and aggregating relevant information across different modalities.
We propose a question-guided self-adaptive multi-modal fusion module to collect the question-oriented and event-correlated evidence through multi-step reasoning.
- Score: 32.332251488360185
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video Question Answering (VideoQA) is a very attractive and challenging
research direction aiming to understand complex semantics of heterogeneous data
from two domains, i.e., the spatio-temporal video content and the word sequence
in question. Although various attention mechanisms have been utilized to manage
contextualized representations by modeling intra- and inter-modal relationships
of the two modalities, one limitation of the predominant VideoQA methods is the
lack of reasoning with event correlation, that is, sensing and analyzing
relationships among abundant and informative events contained in the video. In
this paper, we introduce the dense caption modality as a new auxiliary and
distill event-correlated information from it to infer the correct answer. To
this end, we propose a novel end-to-end trainable model, Event-Correlated Graph
Neural Networks (EC-GNNs), to perform cross-modal reasoning over information
from the three modalities (i.e., caption, video, and question). Besides the
exploitation of a brand new modality, we employ cross-modal reasoning modules
for explicitly modeling inter-modal relationships and aggregating relevant
information across different modalities, and we propose a question-guided
self-adaptive multi-modal fusion module to collect the question-oriented and
event-correlated evidence through multi-step reasoning. We evaluate our model
on two widely-used benchmark datasets and conduct an ablation study to justify
the effectiveness of each proposed component.
Related papers
- Multi-granularity Contrastive Cross-modal Collaborative Generation for End-to-End Long-term Video Question Answering [53.39158264785098]
Long-term Video Question Answering (VideoQA) is a challenging vision-and-language bridging task.
We present an entirely end-to-end solution for VideoQA: Multi-granularity Contrastive cross-modal collaborative Generation model.
arXiv Detail & Related papers (2024-10-12T06:21:58Z) - Collaborative Reasoning on Multi-Modal Semantic Graphs for
Video-Grounded Dialogue Generation [53.87485260058957]
We study video-grounded dialogue generation, where a response is generated based on the dialogue context and the associated video.
The primary challenges of this task lie in (1) the difficulty of integrating video data into pre-trained language models (PLMs)
We propose a multi-agent reinforcement learning method to collaboratively perform reasoning on different modalities.
arXiv Detail & Related papers (2022-10-22T14:45:29Z) - Probing Visual-Audio Representation for Video Highlight Detection via
Hard-Pairs Guided Contrastive Learning [23.472951216815765]
Key to effective video representations is cross-modal representation learning and fine-grained feature discrimination.
In this paper, we enrich intra-modality and cross-modality relations for representation modeling.
We enlarge the discriminative power of feature embedding with a hard-pairs guided contrastive learning scheme.
arXiv Detail & Related papers (2022-06-21T07:29:37Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z) - Unified Graph Structured Models for Video Understanding [93.72081456202672]
We propose a message passing graph neural network that explicitly models relational-temporal relations.
We show how our method is able to more effectively model relationships between relevant entities in the scene.
arXiv Detail & Related papers (2021-03-29T14:37:35Z) - Learning Modality Interaction for Temporal Sentence Localization and
Event Captioning in Videos [76.21297023629589]
We propose a novel method for learning pairwise modality interactions in order to better exploit complementary information for each pair of modalities in videos.
Our method turns out to achieve state-of-the-art performances on four standard benchmark datasets.
arXiv Detail & Related papers (2020-07-28T12:40:59Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.