Interpretable Neural Computation for Real-World Compositional Visual
Question Answering
- URL: http://arxiv.org/abs/2010.04913v1
- Date: Sat, 10 Oct 2020 05:46:22 GMT
- Title: Interpretable Neural Computation for Real-World Compositional Visual
Question Answering
- Authors: Ruixue Tang, Chao Ma
- Abstract summary: We build an interpretable framework for real-world compositional VQA.
In our framework, images and questions are disentangled into scene graphs and programs, and a symbolic program runs on them with full transparency to select the attention regions.
Experiments conducted on the GQA benchmark demonstrate that our framework achieves the compositional prior arts and competitive accuracy among monolithic ones.
- Score: 4.3668650778541895
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There are two main lines of research on visual question answering (VQA):
compositional model with explicit multi-hop reasoning, and monolithic network
with implicit reasoning in the latent feature space. The former excels in
interpretability and compositionality but fails on real-world images, while the
latter usually achieves better performance due to model flexibility and
parameter efficiency. We aim to combine the two to build an interpretable
framework for real-world compositional VQA. In our framework, images and
questions are disentangled into scene graphs and programs, and a symbolic
program executor runs on them with full transparency to select the attention
regions, which are then iteratively passed to a visual-linguistic pre-trained
encoder to predict answers. Experiments conducted on the GQA benchmark
demonstrate that our framework outperforms the compositional prior arts and
achieves competitive accuracy among monolithic ones. With respect to the
validity, plausibility and distribution metrics, our framework surpasses others
by a considerable margin.
Related papers
- ViT-Linearizer: Distilling Quadratic Knowledge into Linear-Time Vision Models [39.520825264698374]
Vision Transformers (ViTs) have delivered remarkable progress through global self-attention, yet their quadratic complexity can become prohibitive for high-resolution inputs.
We present ViT-Linearizer, a cross-architecture distillation framework that transfers rich ViT representations into a linear-time, recurrent-style model.
Our results underscore the good potential of RNN-based solutions for large-scale visual tasks, bridging the gap between theoretical efficiency and real-world practice.
arXiv Detail & Related papers (2025-03-30T15:35:24Z) - Harmonizing Visual Representations for Unified Multimodal Understanding and Generation [53.01486796503091]
We present emphHarmon, a unified autoregressive framework that harmonizes understanding and generation tasks with a shared MAR encoder.
Harmon achieves state-of-the-art image generation results on the GenEval, MJHQ30K and WISE benchmarks.
arXiv Detail & Related papers (2025-03-27T20:50:38Z) - Not Only Text: Exploring Compositionality of Visual Representations in Vision-Language Models [26.525531111141717]
Vision-Language Models learn a shared feature space for text and images, enabling the comparison of inputs of different modalities.
We investigate compositionality in the image domain, where the analysis of compositional properties is challenged by noise and sparsity of visual data.
We propose a framework, called Geodesically Decomposable Embeddings (GDE), that approximates image representations with geometry-aware compositional structures in the latent space.
arXiv Detail & Related papers (2025-03-21T13:46:53Z) - ImageScope: Unifying Language-Guided Image Retrieval via Large Multimodal Model Collective Reasoning [62.61187785810336]
ImageScope is a training-free, three-stage framework that unifies language-guided image retrieval tasks.
In the first stage, we improve the robustness of the framework by synthesizing search intents across varying levels of semantic granularity.
In the second and third stages, we reflect on retrieval results by verifying predicate propositions locally, and performing pairwise evaluations globally.
arXiv Detail & Related papers (2025-03-13T08:43:24Z) - Teaching LMMs for Image Quality Scoring and Interpreting [71.1335005098584]
We propose Q-SiT (Quality Scoring and Interpreting joint Teaching), a unified framework that enables image quality scoring and interpreting simultaneously.
Q-SiT is the first model capable of simultaneously performing image quality scoring and interpreting tasks, along with its lightweight variant, Q-SiT-mini.
Experimental results demonstrate that Q-SiT achieves strong performance in both tasks with superior generalization IQA abilities.
arXiv Detail & Related papers (2025-03-12T09:39:33Z) - LLM-based Hierarchical Concept Decomposition for Interpretable Fine-Grained Image Classification [5.8754760054410955]
We introduce textttHi-CoDecomposition, a novel framework designed to enhance model interpretability through structured concept analysis.
Our approach not only aligns with the performance of state-of-the-art models but also advances transparency by providing clear insights into the decision-making process.
arXiv Detail & Related papers (2024-05-29T00:36:56Z) - Advancing Visual Grounding with Scene Knowledge: Benchmark and Method [74.72663425217522]
Visual grounding (VG) aims to establish fine-grained alignment between vision and language.
Most existing VG datasets are constructed using simple description texts.
We propose a novel benchmark of underlineScene underlineKnowledge-guided underlineVisual underlineGrounding.
arXiv Detail & Related papers (2023-07-21T13:06:02Z) - See, Think, Confirm: Interactive Prompting Between Vision and Language
Models for Knowledge-based Visual Reasoning [60.43585179885355]
We propose a novel framework named Interactive Prompting Visual Reasoner (IPVR) for few-shot knowledge-based visual reasoning.
IPVR contains three stages, see, think and confirm.
We conduct experiments on a range of knowledge-based visual reasoning datasets.
arXiv Detail & Related papers (2023-01-12T18:59:50Z) - SA-VQA: Structured Alignment of Visual and Semantic Representations for
Visual Question Answering [29.96818189046649]
We propose to apply structured alignments, which work with graph representation of visual and textual content.
As demonstrated in our experimental results, such a structured alignment improves reasoning performance.
The proposed model, without any pretraining, outperforms the state-of-the-art methods on GQA dataset, and beats the non-pretrained state-of-the-art methods on VQA-v2 dataset.
arXiv Detail & Related papers (2022-01-25T22:26:09Z) - How to Design Sample and Computationally Efficient VQA Models [53.65668097847456]
We find that representing the text as probabilistic programs and images as object-level scene graphs best satisfy these desiderata.
We extend existing models to leverage these soft programs and scene graphs to train on question answer pairs in an end-to-end manner.
arXiv Detail & Related papers (2021-03-22T01:48:16Z) - Cross-modal Knowledge Reasoning for Knowledge-based Visual Question
Answering [27.042604046441426]
Knowledge-based Visual Question Answering (KVQA) requires external knowledge beyond the visible content to answer questions about an image.
In this paper, we depict an image by multiple knowledge graphs from the visual, semantic and factual views.
We decompose the model into a series of memory-based reasoning steps, each performed by a G raph-based R ead, U pdate, and C ontrol.
We achieve a new state-of-the-art performance on three popular benchmark datasets, including FVQA, Visual7W-KB and OK-VQA.
arXiv Detail & Related papers (2020-08-31T23:25:01Z) - REXUP: I REason, I EXtract, I UPdate with Structured Compositional
Reasoning for Visual Question Answering [4.02726934790798]
We propose a deep reasoning VQA model with explicit visual structure-aware textual information.
REXUP network consists of two branches, image object-oriented and scene graph oriented, which jointly works with super-diagonal fusion compositional attention network.
Our best model significantly outperforms the precious state-of-the-art, which delivers 92.7% on the validation set and 73.1% on the test-dev set.
arXiv Detail & Related papers (2020-07-27T00:54:50Z) - A Flexible Framework for Designing Trainable Priors with Adaptive
Smoothing and Game Encoding [57.1077544780653]
We introduce a general framework for designing and training neural network layers whose forward passes can be interpreted as solving non-smooth convex optimization problems.
We focus on convex games, solved by local agents represented by the nodes of a graph and interacting through regularization functions.
This approach is appealing for solving imaging problems, as it allows the use of classical image priors within deep models that are trainable end to end.
arXiv Detail & Related papers (2020-06-26T08:34:54Z) - Linguistically Driven Graph Capsule Network for Visual Question
Reasoning [153.76012414126643]
We propose a hierarchical compositional reasoning model called the "Linguistically driven Graph Capsule Network"
The compositional process is guided by the linguistic parse tree. Specifically, we bind each capsule in the lowest layer to bridge the linguistic embedding of a single word in the original question with visual evidence.
Experiments on the CLEVR dataset, CLEVR compositional generation test, and FigureQA dataset demonstrate the effectiveness and composition generalization ability of our end-to-end model.
arXiv Detail & Related papers (2020-03-23T03:34:25Z) - Weakly Supervised Visual Semantic Parsing [49.69377653925448]
Scene Graph Generation (SGG) aims to extract entities, predicates and their semantic structure from images.
Existing SGG methods require millions of manually annotated bounding boxes for training.
We propose Visual Semantic Parsing, VSPNet, and graph-based weakly supervised learning framework.
arXiv Detail & Related papers (2020-01-08T03:46:13Z) - Contextual Encoder-Decoder Network for Visual Saliency Prediction [42.047816176307066]
We propose an approach based on a convolutional neural network pre-trained on a large-scale image classification task.
We combine the resulting representations with global scene information for accurately predicting visual saliency.
Compared to state of the art approaches, the network is based on a lightweight image classification backbone.
arXiv Detail & Related papers (2019-02-18T16:15:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.