How to Design Sample and Computationally Efficient VQA Models
- URL: http://arxiv.org/abs/2103.11537v1
- Date: Mon, 22 Mar 2021 01:48:16 GMT
- Title: How to Design Sample and Computationally Efficient VQA Models
- Authors: Karan Samel, Zelin Zhao, Binghong Chen, Kuan Wang, Robin Luo, Le Song
- Abstract summary: We find that representing the text as probabilistic programs and images as object-level scene graphs best satisfy these desiderata.
We extend existing models to leverage these soft programs and scene graphs to train on question answer pairs in an end-to-end manner.
- Score: 53.65668097847456
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In multi-modal reasoning tasks, such as visual question answering (VQA),
there have been many modeling and training paradigms tested. Previous models
propose different methods for the vision and language tasks, but which ones
perform the best while being sample and computationally efficient? Based on our
experiments, we find that representing the text as probabilistic programs and
images as object-level scene graphs best satisfy these desiderata. We extend
existing models to leverage these soft programs and scene graphs to train on
question answer pairs in an end-to-end manner. Empirical results demonstrate
that this differentiable end-to-end program executor is able to maintain
state-of-the-art accuracy while being sample and computationally efficient.
Related papers
- Likelihood as a Performance Gauge for Retrieval-Augmented Generation [78.28197013467157]
We show that likelihoods serve as an effective gauge for language model performance.
We propose two methods that use question likelihood as a gauge for selecting and constructing prompts that lead to better performance.
arXiv Detail & Related papers (2024-11-12T13:14:09Z) - Program Synthesis Benchmark for Visual Programming in XLogoOnline Environment [23.756311527978486]
The benchmark comprises 85 real-world tasks from the Mini-level of the XLogoOnline environment.
We develop a fine-tuning pipeline to boost the performance of models.
We showcase that a fine-tuned Llama3-8B drastically outperforms GPT-4V and Llama3-70B models.
arXiv Detail & Related papers (2024-06-17T08:48:02Z) - Patch-Prompt Aligned Bayesian Prompt Tuning for Vision-Language Models [48.77653835765705]
We introduce a probabilistic resolution to prompt tuning, where the label-specific prompts are generated hierarchically by first sampling a latent vector from an underlying distribution and then employing a lightweight generative model.
We evaluate the effectiveness of our approach on four tasks: few-shot image recognition, base-to-new generalization, dataset transfer learning, and domain shifts.
arXiv Detail & Related papers (2023-03-16T06:09:15Z) - Evaluating Representations with Readout Model Switching [19.907607374144167]
In this paper, we propose to use the Minimum Description Length (MDL) principle to devise an evaluation metric.
We design a hybrid discrete and continuous-valued model space for the readout models and employ a switching strategy to combine their predictions.
The proposed metric can be efficiently computed with an online method and we present results for pre-trained vision encoders of various architectures.
arXiv Detail & Related papers (2023-02-19T14:08:01Z) - Task Formulation Matters When Learning Continually: A Case Study in
Visual Question Answering [58.82325933356066]
Continual learning aims to train a model incrementally on a sequence of tasks without forgetting previous knowledge.
We present a detailed study of how different settings affect performance for Visual Question Answering.
arXiv Detail & Related papers (2022-09-30T19:12:58Z) - Partition Function Estimation: A Quantitative Study [25.782420501870295]
A graphical model's partition function is a central quantity of interest.
Several techniques have been proposed over the years with varying guarantees on the quality of estimates.
Our empirical study draws up a surprising observation: exact techniques are as efficient as the approximate ones.
arXiv Detail & Related papers (2021-05-24T07:25:43Z) - Models, Pixels, and Rewards: Evaluating Design Trade-offs in Visual
Model-Based Reinforcement Learning [109.74041512359476]
We study a number of design decisions for the predictive model in visual MBRL algorithms.
We find that a range of design decisions that are often considered crucial, such as the use of latent spaces, have little effect on task performance.
We show how this phenomenon is related to exploration and how some of the lower-scoring models on standard benchmarks will perform the same as the best-performing models when trained on the same training data.
arXiv Detail & Related papers (2020-12-08T18:03:21Z) - What do we expect from Multiple-choice QA Systems? [70.86513724662302]
We consider a top performing model on several Multiple Choice Question Answering (MCQA) datasets.
We evaluate it against a set of expectations one might have from such a model, using a series of zero-information perturbations of the model's inputs.
arXiv Detail & Related papers (2020-11-20T21:27:10Z) - Can We Learn Heuristics For Graphical Model Inference Using
Reinforcement Learning? [114.24881214319048]
We show that we can learn programs, i.e., policies, for solving inference in higher order Conditional Random Fields (CRFs) using reinforcement learning.
Our method solves inference tasks efficiently without imposing any constraints on the form of the potentials.
arXiv Detail & Related papers (2020-04-27T19:24:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.