Detection-based Intermediate Supervision for Visual Question Answering
- URL: http://arxiv.org/abs/2312.16012v1
- Date: Tue, 26 Dec 2023 11:45:22 GMT
- Title: Detection-based Intermediate Supervision for Visual Question Answering
- Authors: Yuhang Liu, Daowan Peng, Wei Wei, Yuanyuan Fu, Wenfeng Xie, Dangyang
Chen
- Abstract summary: We propose a generative detection framework to facilitate multiple grounding supervisions via sequence generation.
Our proposed DIS offers more comprehensive and accurate intermediate supervisions, thereby boosting answer prediction performance.
Extensive experiments demonstrate the superiority of our proposed DIS, showcasing both improved accuracy and state-of-the-art reasoning consistency.
- Score: 13.96848991623376
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, neural module networks (NMNs) have yielded ongoing success in
answering compositional visual questions, especially those involving multi-hop
visual and logical reasoning. NMNs decompose the complex question into several
sub-tasks using instance-modules from the reasoning paths of that question and
then exploit intermediate supervisions to guide answer prediction, thereby
improving inference interpretability. However, their performance may be
hindered due to sketchy modeling of intermediate supervisions. For instance,
(1) a prior assumption that each instance-module refers to only one grounded
object yet overlooks other potentially associated grounded objects, impeding
full cross-modal alignment learning; (2) IoU-based intermediate supervisions
may introduce noise signals as the bounding box overlap issue might guide the
model's focus towards irrelevant objects. To address these issues, a novel
method, \textbf{\underline{D}}etection-based \textbf{\underline{I}}ntermediate
\textbf{\underline{S}}upervision (DIS), is proposed, which adopts a generative
detection framework to facilitate multiple grounding supervisions via sequence
generation. As such, DIS offers more comprehensive and accurate intermediate
supervisions, thereby boosting answer prediction performance. Furthermore, by
considering intermediate results, DIS enhances the consistency in answering
compositional questions and their sub-questions.Extensive experiments
demonstrate the superiority of our proposed DIS, showcasing both improved
accuracy and state-of-the-art reasoning consistency compared to prior
approaches.
Related papers
- How to Understand "Support"? An Implicit-enhanced Causal Inference
Approach for Weakly-supervised Phrase Grounding [18.97081348819219]
Weakly-supervised Phrase Grounding (WPG) is an emerging task of inferring the fine-grained phrase-region matching.
This paper proposes an Implicit-Enhanced Causal Inference approach to address the challenges of modeling the implicit relations.
arXiv Detail & Related papers (2024-02-29T12:49:48Z) - Topic-driven Distant Supervision Framework for Macro-level Discourse
Parsing [72.14449502499535]
The task of analyzing the internal rhetorical structure of texts is a challenging problem in natural language processing.
Despite the recent advances in neural models, the lack of large-scale, high-quality corpora for training remains a major obstacle.
Recent studies have attempted to overcome this limitation by using distant supervision.
arXiv Detail & Related papers (2023-05-23T07:13:51Z) - TOT: Topology-Aware Optimal Transport For Multimodal Hate Detection [18.015012133043093]
We propose TOT: a topology-aware optimal transport framework to decipher the implicit harm in memes scenario.
Specifically, we leverage an optimal transport kernel method to capture complementary information from multiple modalities.
The newly achieved state-of-the-art performance on two publicly available benchmark datasets, together with further visual analysis, demonstrate the superiority of TOT.
arXiv Detail & Related papers (2023-02-27T06:58:19Z) - Unpaired Referring Expression Grounding via Bidirectional Cross-Modal
Matching [53.27673119360868]
Referring expression grounding is an important and challenging task in computer vision.
We propose a novel bidirectional cross-modal matching (BiCM) framework to address these challenges.
Our framework outperforms previous works by 6.55% and 9.94% on two popular grounding datasets.
arXiv Detail & Related papers (2022-01-18T01:13:19Z) - Progressively Guide to Attend: An Iterative Alignment Framework for
Temporal Sentence Grounding [53.377028000325424]
We propose an Iterative Alignment Network (IA-Net) for temporal sentence grounding task.
We pad multi-modal features with learnable parameters to alleviate the nowhere-to-attend problem of non-matched frame-word pairs.
We also devise a calibration module following each attention module to refine the alignment knowledge.
arXiv Detail & Related papers (2021-09-14T02:08:23Z) - Weakly-Supervised Spatio-Temporal Anomaly Detection in Surveillance
Video [128.41392860714635]
We introduce Weakly-Supervised Snoma-Temporally Detection (WSSTAD) in surveillance video.
WSSTAD aims to localize a-temporal tube (i.e. sequence of bounding boxes at consecutive times) that encloses abnormal event.
We propose a dual-branch network which takes as input proposals with multi-granularities in both spatial-temporal domains.
arXiv Detail & Related papers (2021-08-09T06:11:14Z) - Paired Examples as Indirect Supervision in Latent Decision Models [109.76417071249945]
We introduce a way to leverage paired examples that provide stronger cues for learning latent decisions.
We apply our method to improve compositional question answering using neural module networks on the DROP dataset.
arXiv Detail & Related papers (2021-04-05T03:58:30Z) - Latent Compositional Representations Improve Systematic Generalization
in Grounded Question Answering [46.87501300706542]
State-of-the-art models in grounded question answering often do not explicitly perform decomposition.
We propose a model that computes a representation and denotation for all question spans in a bottom-up, compositional manner.
Our model induces latent trees, driven by end-to-end (the answer) only.
arXiv Detail & Related papers (2020-07-01T06:22:51Z) - Obtaining Faithful Interpretations from Compositional Neural Networks [72.41100663462191]
We evaluate the intermediate outputs of NMNs on NLVR2 and DROP datasets.
We find that the intermediate outputs differ from the expected output, illustrating that the network structure does not provide a faithful explanation of model behaviour.
arXiv Detail & Related papers (2020-05-02T06:50:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.