LCV2: An Efficient Pretraining-Free Framework for Grounded Visual Question Answering
- URL: http://arxiv.org/abs/2401.15842v2
- Date: Sat, 23 Mar 2024 02:46:54 GMT
- Title: LCV2: An Efficient Pretraining-Free Framework for Grounded Visual Question Answering
- Authors: Yuhan Chen, Lumei Su, Lihua Chen, Zhiwei Lin,
- Abstract summary: The LCV2 modular method is proposed for the Grounded Visual Question Answering task in the vision-language multimodal domain.
This approach relies on a frozen large language model (LLM) as intermediate mediator between the off-the-shelf VQA model and the off-the-shelf visual grounding (VG) model.
This framework can be deployed for VQA Grounding tasks under low computational resources.
- Score: 6.263815658578159
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, the LCV2 modular method is proposed for the Grounded Visual Question Answering task in the vision-language multimodal domain. This approach relies on a frozen large language model (LLM) as intermediate mediator between the off-the-shelf VQA model and the off-the-shelf visual grounding (VG) model, where the LLM transforms and conveys textual information between the two modules based on a designed prompt. LCV2 establish an integrated plug-and-play framework without the need for any pre-training process. This framework can be deployed for VQA Grounding tasks under low computational resources. The modularized model within the framework allows application with various state-of-the-art pre-trained models, exhibiting significant potential to be advance with the times. Experimental implementations were conducted under constrained computational and memory resources, evaluating the proposed method's performance on benchmark datasets including GQA, CLEVR, and VizWiz-VQA-Grounding. Comparative analyses with baseline methods demonstrate the robust competitiveness of LCV2.
Related papers
- Reinforcement Tuning for Detecting Stances and Debunking Rumors Jointly with Large Language Models [13.356554246394692]
Large language models (LLMs) are the foundation annotators for the joint stance detection (SD) and rumor verification (RV) tasks, dubbed as JSDRV.
We introduce a novel reinforcement tuning framework to enhance the joint predictive capabilities of LLM-based SD and RV components.
Results demonstrate that JSDRV improves the capabilities of LLMs in the joint tasks, not only outperforming state-of-the-art methods but also generalizing to non-LLMs accommodated as task models.
arXiv Detail & Related papers (2024-06-04T09:31:18Z) - Memory-guided Network with Uncertainty-based Feature Augmentation for Few-shot Semantic Segmentation [12.653336728447654]
We propose a class-shared memory (CSM) module consisting of a set of learnable memory vectors.
These memory vectors learn elemental object patterns from base classes during training whilst re-encoding query features during both training and inference.
We integrate CSM and UFA into representative FSS works, with experimental results on the widely-used PASCAL-5$i$ and COCO-20$i$ datasets.
arXiv Detail & Related papers (2024-06-01T19:53:25Z) - Driving Referring Video Object Segmentation with Vision-Language Pre-trained Models [34.37450315995176]
Current RVOS methods typically use vision and language models pre-trained independently as backbones.
We propose a temporal-aware prompt-tuning method, which adapts pre-trained representations for pixel-level prediction.
Our method outperforms state-of-the-art algorithms and exhibits strong generalization abilities.
arXiv Detail & Related papers (2024-05-17T08:14:22Z) - Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models [73.40350756742231]
Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning.
Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored.
arXiv Detail & Related papers (2024-02-12T18:21:14Z) - ES-MVSNet: Efficient Framework for End-to-end Self-supervised Multi-View
Stereo [11.41432976633312]
In this work, we propose an efficient framework for end-to-end self-supervised MVS, dubbed ES-MVSNet.
To alleviate the high memory consumption of current E2E self-supervised MVS frameworks, we present a memory-efficient architecture that reduces memory usage by 43% without compromising model performance.
With the novel design of asymmetric view selection policy and region-aware depth consistency, we achieve state-of-the-art performance among E2E self-supervised MVS methods, without relying on third-party models for additional consistency signals.
arXiv Detail & Related papers (2023-08-04T08:16:47Z) - ProbVLM: Probabilistic Adapter for Frozen Vision-Language Models [69.50316788263433]
We propose ProbVLM, a probabilistic adapter that estimates probability distributions for the embeddings of pre-trained vision-language models.
We quantify the calibration of embedding uncertainties in retrieval tasks and show that ProbVLM outperforms other methods.
We present a novel technique for visualizing the embedding distributions using a large-scale pre-trained latent diffusion model.
arXiv Detail & Related papers (2023-07-01T18:16:06Z) - Zero-shot Visual Question Answering with Language Model Feedback [83.65140324876536]
We propose a language model guided captioning approach, LAMOC, for knowledge-based visual question answering (VQA)
Our approach employs the generated captions by a captioning model as the context of an answer prediction model, which is a Pre-trained Language model (PLM)
arXiv Detail & Related papers (2023-05-26T15:04:20Z) - Switchable Representation Learning Framework with Self-compatibility [50.48336074436792]
We propose a Switchable representation learning Framework with Self-Compatibility (SFSC)
SFSC generates a series of compatible sub-models with different capacities through one training process.
SFSC achieves state-of-the-art performance on the evaluated datasets.
arXiv Detail & Related papers (2022-06-16T16:46:32Z) - Transfer Learning in Multi-Agent Reinforcement Learning with Double
Q-Networks for Distributed Resource Sharing in V2X Communication [24.442174952832108]
This paper addresses the problem of decentralized spectrum sharing in vehicle-to-everything (V2X) communication networks.
The aim is to provide resource-efficient coexistence of vehicle-to-infrastructure(V2I) and vehicle-to-vehicle(V2V) links.
arXiv Detail & Related papers (2021-07-13T15:50:10Z) - WenLan: Bridging Vision and Language by Large-Scale Multi-Modal
Pre-Training [71.37731379031487]
We propose a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework.
Unlike OpenAI CLIP that adopts a simple contrastive learning method, we devise a more advanced algorithm by adapting the latest method MoCo into the cross-modal scenario.
By building a large queue-based dictionary, our BriVL can incorporate more negative samples in limited GPU resources.
arXiv Detail & Related papers (2021-03-11T09:39:49Z) - Unsupervised Vision-and-Language Pre-training Without Parallel Images
and Captions [92.47566804182338]
We investigate if a strong V&L representation model can be learned through unsupervised pre-training without image-caption corpora.
In particular, we propose to conduct mask-and-predict'' pre-training on text-only and image-only corpora.
We find that such a simple approach performance close to a model pre-trained with aligned data, on four English V&L benchmarks.
arXiv Detail & Related papers (2020-10-24T08:17:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.