Hierarchical Conditional Relation Networks for Multimodal Video Question
Answering
- URL: http://arxiv.org/abs/2010.10019v2
- Date: Sun, 3 Jan 2021 07:11:23 GMT
- Title: Hierarchical Conditional Relation Networks for Multimodal Video Question
Answering
- Authors: Thao Minh Le, Vuong Le, Svetha Venkatesh, Truyen Tran
- Abstract summary: Video QA adds at least two more layers of complexity - selecting relevant content for each channel in the context of a linguistic query.
Conditional Relation Network (CRN) takes as input a set of tensorial objects translating into a new set of objects that encode relations of the inputs.
CRN is then applied for Video QA in two forms, short-form where answers are reasoned solely from the visual content, and long-form where associated information, such as subtitles, is presented.
- Score: 67.85579756590478
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video QA challenges modelers in multiple fronts. Modeling video necessitates
building not only spatio-temporal models for the dynamic visual channel but
also multimodal structures for associated information channels such as
subtitles or audio. Video QA adds at least two more layers of complexity -
selecting relevant content for each channel in the context of the linguistic
query, and composing spatio-temporal concepts and relations in response to the
query. To address these requirements, we start with two insights: (a) content
selection and relation construction can be jointly encapsulated into a
conditional computational structure, and (b) video-length structures can be
composed hierarchically. For (a) this paper introduces a general-reusable
neural unit dubbed Conditional Relation Network (CRN) taking as input a set of
tensorial objects and translating into a new set of objects that encode
relations of the inputs. The generic design of CRN helps ease the common
complex model building process of Video QA by simple block stacking with
flexibility in accommodating input modalities and conditioning features across
both different domains. As a result, we realize insight (b) by introducing
Hierarchical Conditional Relation Networks (HCRN) for Video QA. The HCRN
primarily aims at exploiting intrinsic properties of the visual content of a
video and its accompanying channels in terms of compositionality, hierarchy,
and near and far-term relation. HCRN is then applied for Video QA in two forms,
short-form where answers are reasoned solely from the visual content, and
long-form where associated information, such as subtitles, presented. Our
rigorous evaluations show consistent improvements over SOTAs on well-studied
benchmarks including large-scale real-world datasets such as TGIF-QA and TVQA,
demonstrating the strong capabilities of our CRN unit and the HCRN for complex
domains such as Video QA.
Related papers
- Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering [71.62961521518731]
HeurVidQA is a framework that leverages domain-specific entity-actions to refine pre-trained video-language foundation models.
Our approach treats these models as implicit knowledge engines, employing domain-specific entity-action prompters to direct the model's focus toward precise cues that enhance reasoning.
arXiv Detail & Related papers (2024-10-12T06:22:23Z) - RTQ: Rethinking Video-language Understanding Based on Image-text Model [55.278942477715084]
Video-language understanding presents unique challenges due to the inclusion of highly complex semantic details.
We propose a novel framework called RTQ, which addresses these challenges simultaneously.
Our model demonstrates outstanding performance even in the absence of video-language pre-training.
arXiv Detail & Related papers (2023-12-01T04:51:01Z) - Dense but Efficient VideoQA for Intricate Compositional Reasoning [9.514382838449928]
We suggest a new VideoQA method based on transformer with a deformable attention mechanism to address the complex tasks.
The dependency structure within the complex question sentences is also combined with the language embeddings to readily understand the semantic relations among question words.
arXiv Detail & Related papers (2022-10-19T05:01:20Z) - Structured Two-stream Attention Network for Video Question Answering [168.95603875458113]
We propose a Structured Two-stream Attention network, namely STA, to answer a free-form or open-ended natural language question.
First, we infer rich long-range temporal structures in videos using our structured segment component and encode text features.
Then, our structured two-stream attention component simultaneously localizes important visual instance, reduces the influence of background video and focuses on the relevant text.
arXiv Detail & Related papers (2022-06-02T12:25:52Z) - DeepQAMVS: Query-Aware Hierarchical Pointer Networks for Multi-Video
Summarization [127.16984421969529]
We introduce a novel Query-Aware Hierarchical Pointer Network for Multi-Video Summarization, termed DeepQAMVS.
DeepQAMVS is trained with reinforcement learning, incorporating rewards that capture representativeness, diversity, query-adaptability and temporal coherence.
We achieve state-of-the-art results on the MVS1K dataset, with inference time scaling linearly with the number of input video frames.
arXiv Detail & Related papers (2021-05-13T17:33:26Z) - Hierarchical Conditional Relation Networks for Video Question Answering [62.1146543269993]
We introduce a general-purpose reusable neural unit called Conditional Relation Network (CRN)
CRN serves as a building block to construct more sophisticated structures for representation and reasoning over video.
Our evaluations on well-known datasets achieved new SoTA results, demonstrating the impact of building a general-purpose reasoning unit on complex domains such as VideoQA.
arXiv Detail & Related papers (2020-02-25T07:00:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.