Related papers: Analyzing Modular Approaches for Visual Question Decomposition

Analyzing Modular Approaches for Visual Question Decomposition

URL: http://arxiv.org/abs/2311.06411v1
Date: Fri, 10 Nov 2023 22:14:26 GMT
Title: Analyzing Modular Approaches for Visual Question Decomposition
Authors: Apoorv Khandelwal, Ellie Pavlick, Chen Sun
Abstract summary: Modular neural networks without additional training have recently been shown to surpass end-to-end neural networks on vision-language tasks. This paper asks where its additional performance comes from and how much is due to the (state-of-art, end-to-end) BLIP-2 model it subsumes vs. additional symbolic components. We find that ViperGPT's reported gains over BLIP-2 can be attributed to its selection of task-specific modules, and when we run ViperGPT using a more task-agnostic selection of modules, these gains go away.
Score: 38.73070270272822
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modular neural networks without additional training have recently been shown to surpass end-to-end neural networks on challenging vision-language tasks. The latest such methods simultaneously introduce LLM-based code generation to build programs and a number of skill-specific, task-oriented modules to execute them. In this paper, we focus on ViperGPT and ask where its additional performance comes from and how much is due to the (state-of-art, end-to-end) BLIP-2 model it subsumes vs. additional symbolic components. To do so, we conduct a controlled study (comparing end-to-end, modular, and prompting-based methods across several VQA benchmarks). We find that ViperGPT's reported gains over BLIP-2 can be attributed to its selection of task-specific modules, and when we run ViperGPT using a more task-agnostic selection of modules, these gains go away. Additionally, ViperGPT retains much of its performance if we make prominent alterations to its selection of modules: e.g. removing or retaining only BLIP-2. Finally, we compare ViperGPT against a prompting-based decomposition strategy and find that, on some benchmarks, modular approaches significantly benefit by representing subtasks with natural language, instead of code.

Related papers

Learning to Route for Dynamic Adapter Composition in Continual Learning with Language Models [56.93608812478369]
We present L2R, a method that isolates the training of new PEFT modules to ensure their task specialization. L2R then learns to compose the learned modules by training a network of routers that leverages a small memory containing examples of previously seen tasks. Our results demonstrate that L2R provides an effective composition of PEFT modules, leading to improved generalization and performance compared to other methods.
arXiv Detail & Related papers (2024-08-16T23:57:29Z)
Mixture-of-Modules: Reinventing Transformers as Dynamic Assemblies of Modules [96.21649779507831]
We propose a novel architecture dubbed mixture-of-modules (MoM) MoM is motivated by an intuition that any layer, regardless of its position, can be used to compute a token. We show that MoM provides not only a unified framework for Transformers but also a flexible and learnable approach for reducing redundancy.
arXiv Detail & Related papers (2024-07-09T08:50:18Z)
Deep Submodular Peripteral Networks [1.8061637661945513]
We introduce deep submodular peripteral networks (DSPNs), a novel family of submodular functions, and methods for their training. We demonstrate DSPNs' efficacy in learning submodularity from a costly target submodular function and demonstrate its superiority both for experimental design and online streaming applications.
arXiv Detail & Related papers (2024-03-13T02:53:52Z)
GENOME: GenerativE Neuro-symbOlic visual reasoning by growing and reusing ModulEs [64.49176353858792]
We propose generative neuro-symbolic visual reasoning by growing and reusing modules. The proposed model performs competitively on standard tasks like visual question answering and referring expression comprehension. It is able to adapt to new visual reasoning tasks by observing a few training examples and reusing modules.
arXiv Detail & Related papers (2023-11-08T18:59:05Z)
ViperGPT: Visual Inference via Python Execution for Reasoning [23.56704214763551]
We introduce ViperGPT, a framework that composes vision-and-language models into subroutines to produce a result for any query. This simple approach requires no further training, and achieves state-of-the-art results across various complex visual tasks.
arXiv Detail & Related papers (2023-03-14T17:57:47Z)
USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text Retrieval [115.28586222748478]
Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality. Existing approaches typically suffer from two major limitations.
arXiv Detail & Related papers (2023-01-17T12:42:58Z)
BatchFormerV2: Exploring Sample Relationships for Dense Representation Learning [88.82371069668147]
BatchFormerV2 is a more general batch Transformer module, which enables exploring sample relationships for dense representation learning. BatchFormerV2 consistently improves current DETR-based detection methods by over 1.3%.
arXiv Detail & Related papers (2022-04-04T05:53:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.