Improving Generalization in Visual Reasoning via Self-Ensemble
- URL: http://arxiv.org/abs/2410.20883v2
- Date: Fri, 01 Nov 2024 12:42:49 GMT
- Title: Improving Generalization in Visual Reasoning via Self-Ensemble
- Authors: Tien-Huy Nguyen, Quang-Khai Tran, Anh-Tuan Quang-Hoang,
- Abstract summary: We propose self-ensemble, a novel method that improves the generalization and visual reasoning of the model without updating any parameters.
Our key insight is that LVLM itself can ensemble without the need for any other LVLMs, which helps to unlock their internal capabilities.
- Score: 0.0
- License:
- Abstract: The cognitive faculty of visual reasoning necessitates the integration of multimodal perceptual processing and commonsense and external knowledge of the world. In recent years, a plethora of large vision-language models (LVLMs) have been proposed, demonstrating outstanding power and exceptional proficiency in commonsense reasoning across diverse domains and tasks. Nevertheless, training such LVLMs requires a lot of costly resources. Recent approaches, instead of training LVLMs from scratch on various large datasets, focus on exploring ways to take advantage of the capabilities of many different LVLMs, such as ensemble methods. In this work, we propose self-ensemble, a novel method that improves the generalization and visual reasoning of the model without updating any parameters, a training-free method. Our key insight is that we realized that LVLM itself can ensemble without the need for any other LVLMs, which helps to unlock their internal capabilities. Extensive experiments on various benchmarks demonstrate the effectiveness of our method in achieving state-of-the-art (SOTA) performance on SketchyVQA, Outside Knowledge VQA, and out-of-distribution VQA tasks.
Related papers
- Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering [44.008094698200026]
We introduce a novel method to enhance the adaptability of MLLMs by integrating external knowledge sources.
Our proposed model, Reflective LLaVA (ReflectiVA), utilizes reflective tokens to dynamically determine the need for external knowledge.
This ultimately enables the MLLM to manage external knowledge while preserving fluency and performance on tasks where external knowledge is not needed.
arXiv Detail & Related papers (2024-11-25T19:01:03Z) - Can MLLMs Guide Weakly-Supervised Temporal Action Localization Tasks? [6.7065734065794835]
We introduce a novel learning paradigm termed MLLM4WTAL.
It harnesses the potential of MLLM to offer temporal action key semantics and complete semantic priors.
It achieves this by integrating two distinct modules: Key Semantic Matching (KSM) and Complete Semantic Reconstruction (CSR)
arXiv Detail & Related papers (2024-11-13T09:37:24Z) - RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks.
Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs.
In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z) - Large Vision-Language Models as Emotion Recognizers in Context Awareness [14.85890824622433]
Context-aware emotion recognition (CAER) is a complex and significant task that requires perceiving emotions from various contextual cues.
Previous approaches primarily focus on designing sophisticated architectures to extract emotional cues from images.
This paper systematically explore the potential of leveraging Large Vision-Language Models (LVLMs) to empower the CAER task.
arXiv Detail & Related papers (2024-07-16T01:28:06Z) - Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [56.391404083287235]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.
Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.
We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models [87.47400128150032]
We propose a novel LMM architecture named Lumen, a Large multimodal model with versatile vision-centric capability enhancement.
Lumen first promotes fine-grained vision-language concept alignment.
Then the task-specific decoding is carried out by flexibly routing the shared representation to lightweight task decoders.
arXiv Detail & Related papers (2024-03-12T04:13:45Z) - Vision-Language Models Provide Promptable Representations for Reinforcement Learning [67.40524195671479]
We propose a novel approach that uses the vast amounts of general and indexable world knowledge encoded in vision-language models (VLMs) pre-trained on Internet-scale data for embodied reinforcement learning (RL)
We show that our approach can use chain-of-thought prompting to produce representations of common-sense semantic reasoning, improving policy performance in novel scenes by 1.5 times.
arXiv Detail & Related papers (2024-02-05T00:48:56Z) - On the Performance of Multimodal Language Models [4.677125897916577]
This study conducts a comparative analysis of different multimodal instruction tuning approaches.
We reveal key insights for guiding architectural choices when incorporating multimodal capabilities into large language models.
arXiv Detail & Related papers (2023-10-04T23:33:36Z) - Corex: Pushing the Boundaries of Complex Reasoning through Multi-Model Collaboration [83.4031923134958]
Corex is a suite of novel general-purpose strategies that transform Large Language Models into autonomous agents.
Inspired by human behaviors, Corex is constituted by diverse collaboration paradigms including Debate, Review, and Retrieve modes.
We demonstrate that orchestrating multiple LLMs to work in concert yields substantially better performance compared to existing methods.
arXiv Detail & Related papers (2023-09-30T07:11:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.