Bridging the Gap: Exploring the Capabilities of Bridge-Architectures for
Complex Visual Reasoning Tasks
- URL: http://arxiv.org/abs/2307.16395v1
- Date: Mon, 31 Jul 2023 03:57:31 GMT
- Title: Bridging the Gap: Exploring the Capabilities of Bridge-Architectures for
Complex Visual Reasoning Tasks
- Authors: Kousik Rajesh, Mrigank Raman, Mohammed Asad Karim, Pranit Chawla
- Abstract summary: Bridge-architectures project from the image space to the text space to solve tasks such as VQA, captioning, and image retrieval.
We extend the traditional bridge architectures for the NLVR2 dataset, by adding object level features to faciliate fine-grained object reasoning.
Our analysis shows that adding object level features to bridge architectures does not help, and that pre-training on multi-modal data is key for good performance on complex reasoning tasks such as NLVR2.
- Score: 4.093474663507322
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent times there has been a surge of multi-modal architectures based on
Large Language Models, which leverage the zero shot generation capabilities of
LLMs and project image embeddings into the text space and then use the
auto-regressive capacity to solve tasks such as VQA, captioning, and image
retrieval. We name these architectures as "bridge-architectures" as they
project from the image space to the text space. These models deviate from the
traditional recipe of training transformer based multi-modal models, which
involve using large-scale pre-training and complex multi-modal interactions
through co or cross attention. However, the capabilities of bridge
architectures have not been tested on complex visual reasoning tasks which
require fine grained analysis about the image. In this project, we investigate
the performance of these bridge-architectures on the NLVR2 dataset, and compare
it to state-of-the-art transformer based architectures. We first extend the
traditional bridge architectures for the NLVR2 dataset, by adding object level
features to faciliate fine-grained object reasoning. Our analysis shows that
adding object level features to bridge architectures does not help, and that
pre-training on multi-modal data is key for good performance on complex
reasoning tasks such as NLVR2. We also demonstrate some initial results on a
recently bridge-architecture, LLaVA, in the zero shot setting and analyze its
performance.
Related papers
- AsCAN: Asymmetric Convolution-Attention Networks for Efficient Recognition and Generation [48.82264764771652]
We introduce AsCAN -- a hybrid architecture, combining both convolutional and transformer blocks.
AsCAN supports a variety of tasks: recognition, segmentation, class-conditional image generation.
We then scale the same architecture to solve a large-scale text-to-image task and show state-of-the-art performance.
arXiv Detail & Related papers (2024-11-07T18:43:17Z) - Mixed-Query Transformer: A Unified Image Segmentation Architecture [57.32212654642384]
Existing unified image segmentation models either employ a unified architecture across multiple tasks but use separate weights tailored to each dataset, or apply a single set of weights to multiple datasets but are limited to a single task.
We introduce the Mixed-Query Transformer (MQ-Former), a unified architecture for multi-task and multi-dataset image segmentation using a single set of weights.
arXiv Detail & Related papers (2024-04-06T01:54:17Z) - Building Optimal Neural Architectures using Interpretable Knowledge [15.66288233048004]
AutoBuild is a scheme which learns to align the latent embeddings of operations and architecture modules with the ground-truth performance of the architectures they appear in.
We show that by mining a relatively small set of evaluated architectures, AutoBuild can learn to build high-quality architectures directly or help to reduce search space to focus on relevant areas.
arXiv Detail & Related papers (2024-03-20T04:18:38Z) - RSBuilding: Towards General Remote Sensing Image Building Extraction and Change Detection with Foundation Model [22.56227565913003]
We propose a comprehensive remote sensing image building model, termed RSBuilding, developed from the perspective of the foundation model.
RSBuilding is designed to enhance cross-scene generalization and task understanding.
Our model was trained on a dataset comprising up to 245,000 images and validated on multiple building extraction and change detection datasets.
arXiv Detail & Related papers (2024-03-12T11:51:59Z) - The Impact of Different Backbone Architecture on Autonomous Vehicle
Dataset [120.08736654413637]
The quality of the features extracted by the backbone architecture can have a significant impact on the overall detection performance.
Our study evaluates three well-known autonomous vehicle datasets, namely KITTI, NuScenes, and BDD, to compare the performance of different backbone architectures on object detection tasks.
arXiv Detail & Related papers (2023-09-15T17:32:15Z) - BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning [91.93547262073213]
Vision-Language (VL) models with the Two-Tower architecture have dominated visual representation learning in recent years.
We propose BridgeTower, which introduces multiple bridge layers that build a connection between the top layers of uni-modal encoders and each layer of the cross-modal encoder.
BridgeTower achieves an accuracy of 78.73%, outperforming the previous state-of-the-art model METER by 1.09% with the same pre-training data.
arXiv Detail & Related papers (2022-06-17T09:42:35Z) - Rethinking Multi-Modal Alignment in Video Question Answering from
Feature and Sample Perspectives [30.666823939595627]
This paper reconsiders the multi-modal alignment problem in VideoQA from feature and sample perspectives.
We adopt a heterogeneous graph architecture and design a hierarchical framework to align both trajectory-level and frame-level visual feature with language feature.
Our method outperforms all the state-of-the-art models on the challenging NExT-QA benchmark.
arXiv Detail & Related papers (2022-04-25T10:42:07Z) - Stage-Wise Neural Architecture Search [65.03109178056937]
Modern convolutional networks such as ResNet and NASNet have achieved state-of-the-art results in many computer vision applications.
These networks consist of stages, which are sets of layers that operate on representations in the same resolution.
It has been demonstrated that increasing the number of layers in each stage improves the prediction ability of the network.
However, the resulting architecture becomes computationally expensive in terms of floating point operations, memory requirements and inference time.
arXiv Detail & Related papers (2020-04-23T14:16:39Z) - Hierarchical Neural Architecture Search for Single Image
Super-Resolution [18.624661846174412]
Deep neural networks have exhibited promising performance in image super-resolution (SR)
Most SR models follow a hierarchical architecture that contains both the cell-level design of computational blocks and the network-level design of the positions of upsampling blocks.
We propose a Hierarchical Neural Architecture Search (HNAS) method to automatically design promising architectures with different requirements of computation cost.
arXiv Detail & Related papers (2020-03-10T10:19:44Z) - Hierarchical Conditional Relation Networks for Video Question Answering [62.1146543269993]
We introduce a general-purpose reusable neural unit called Conditional Relation Network (CRN)
CRN serves as a building block to construct more sophisticated structures for representation and reasoning over video.
Our evaluations on well-known datasets achieved new SoTA results, demonstrating the impact of building a general-purpose reasoning unit on complex domains such as VideoQA.
arXiv Detail & Related papers (2020-02-25T07:00:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.