Analyzing Modular Approaches for Visual Question Decomposition
- URL: http://arxiv.org/abs/2311.06411v1
- Date: Fri, 10 Nov 2023 22:14:26 GMT
- Title: Analyzing Modular Approaches for Visual Question Decomposition
- Authors: Apoorv Khandelwal, Ellie Pavlick, Chen Sun
- Abstract summary: Modular neural networks without additional training have recently been shown to surpass end-to-end neural networks on vision-language tasks.
This paper asks where its additional performance comes from and how much is due to the (state-of-art, end-to-end) BLIP-2 model it subsumes vs. additional symbolic components.
We find that ViperGPT's reported gains over BLIP-2 can be attributed to its selection of task-specific modules, and when we run ViperGPT using a more task-agnostic selection of modules, these gains go away.
- Score: 38.73070270272822
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modular neural networks without additional training have recently been shown
to surpass end-to-end neural networks on challenging vision-language tasks. The
latest such methods simultaneously introduce LLM-based code generation to build
programs and a number of skill-specific, task-oriented modules to execute them.
In this paper, we focus on ViperGPT and ask where its additional performance
comes from and how much is due to the (state-of-art, end-to-end) BLIP-2 model
it subsumes vs. additional symbolic components. To do so, we conduct a
controlled study (comparing end-to-end, modular, and prompting-based methods
across several VQA benchmarks). We find that ViperGPT's reported gains over
BLIP-2 can be attributed to its selection of task-specific modules, and when we
run ViperGPT using a more task-agnostic selection of modules, these gains go
away. Additionally, ViperGPT retains much of its performance if we make
prominent alterations to its selection of modules: e.g. removing or retaining
only BLIP-2. Finally, we compare ViperGPT against a prompting-based
decomposition strategy and find that, on some benchmarks, modular approaches
significantly benefit by representing subtasks with natural language, instead
of code.
Related papers
- Learning to Route for Dynamic Adapter Composition in Continual Learning with Language Models [56.93608812478369]
We present L2R, a method that isolates the training of new PEFT modules to ensure their task specialization.
L2R then learns to compose the learned modules by training a network of routers that leverages a small memory containing examples of previously seen tasks.
Our results demonstrate that L2R provides an effective composition of PEFT modules, leading to improved generalization and performance compared to other methods.
arXiv Detail & Related papers (2024-08-16T23:57:29Z) - Mixture-of-Modules: Reinventing Transformers as Dynamic Assemblies of Modules [96.21649779507831]
We propose a novel architecture dubbed mixture-of-modules (MoM)
MoM is motivated by an intuition that any layer, regardless of its position, can be used to compute a token.
We show that MoM provides not only a unified framework for Transformers but also a flexible and learnable approach for reducing redundancy.
arXiv Detail & Related papers (2024-07-09T08:50:18Z) - Deep Submodular Peripteral Networks [1.8061637661945513]
We introduce deep submodular peripteral networks (DSPNs), a novel family of submodular functions, and methods for their training.
We demonstrate DSPNs' efficacy in learning submodularity from a costly target submodular function and demonstrate its superiority both for experimental design and online streaming applications.
arXiv Detail & Related papers (2024-03-13T02:53:52Z) - GENOME: GenerativE Neuro-symbOlic visual reasoning by growing and
reusing ModulEs [64.49176353858792]
We propose generative neuro-symbolic visual reasoning by growing and reusing modules.
The proposed model performs competitively on standard tasks like visual question answering and referring expression comprehension.
It is able to adapt to new visual reasoning tasks by observing a few training examples and reusing modules.
arXiv Detail & Related papers (2023-11-08T18:59:05Z) - ViperGPT: Visual Inference via Python Execution for Reasoning [23.56704214763551]
We introduce ViperGPT, a framework that composes vision-and-language models into subroutines to produce a result for any query.
This simple approach requires no further training, and achieves state-of-the-art results across various complex visual tasks.
arXiv Detail & Related papers (2023-03-14T17:57:47Z) - USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text
Retrieval [115.28586222748478]
Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality.
Existing approaches typically suffer from two major limitations.
arXiv Detail & Related papers (2023-01-17T12:42:58Z) - BatchFormerV2: Exploring Sample Relationships for Dense Representation
Learning [88.82371069668147]
BatchFormerV2 is a more general batch Transformer module, which enables exploring sample relationships for dense representation learning.
BatchFormerV2 consistently improves current DETR-based detection methods by over 1.3%.
arXiv Detail & Related papers (2022-04-04T05:53:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.