IdealGPT: Iteratively Decomposing Vision and Language Reasoning via
Large Language Models
- URL: http://arxiv.org/abs/2305.14985v1
- Date: Wed, 24 May 2023 10:19:57 GMT
- Title: IdealGPT: Iteratively Decomposing Vision and Language Reasoning via
Large Language Models
- Authors: Haoxuan You, Rui Sun, Zhecan Wang, Long Chen, Gengyu Wang, Hammad A.
Ayyubi, Kai-Wei Chang, Shih-Fu Chang
- Abstract summary: We develop a framework that decomposes vision-and-language (VL) reasoning using large language models (LLMs)
In particular, our IdealGPT outperforms the best existing GPT-4-like models by an absolute 10% on VCR and 15% on SNLI-VE.
- Score: 77.0577928874177
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The field of vision-and-language (VL) understanding has made unprecedented
progress with end-to-end large pre-trained VL models (VLMs). However, they
still fall short in zero-shot reasoning tasks that require multi-step
inferencing. To achieve this goal, previous works resort to a
divide-and-conquer pipeline. In this paper, we argue that previous efforts have
several inherent shortcomings: 1) They rely on domain-specific sub-question
decomposing models. 2) They force models to predict the final answer even if
the sub-questions or sub-answers provide insufficient information. We address
these limitations via IdealGPT, a framework that iteratively decomposes VL
reasoning using large language models (LLMs). Specifically, IdealGPT utilizes
an LLM to generate sub-questions, a VLM to provide corresponding sub-answers,
and another LLM to reason to achieve the final answer. These three modules
perform the divide-and-conquer procedure iteratively until the model is
confident about the final answer to the main question. We evaluate IdealGPT on
multiple challenging VL reasoning tasks under a zero-shot setting. In
particular, our IdealGPT outperforms the best existing GPT-4-like models by an
absolute 10% on VCR and 15% on SNLI-VE. Code is available at
https://github.com/Hxyou/IdealGPT
Related papers
- VaPR -- Vision-language Preference alignment for Reasoning [43.4847999322297]
We introduce a hard-negative response generation framework based on LLM-guided response editing.<n>VaPR produces rejected responses with targeted errors, maintaining stylistic and length similarity to the accepted ones.<n>We show that VaPR generalizes to open-source LLMs as editors, with models trained on VaPR-OS achieving 99% of the performance of models trained on name.
arXiv Detail & Related papers (2025-10-02T06:10:43Z) - Evaluating Cell Type Inference in Vision Language Models Under Varying Visual Context [0.16385815610837165]
Vision-Language Models (VLMs) have rapidly advanced alongside Large Language Models (LLMs)<n>This study evaluates the capabilities of prominent generative VLMs, such as GPT-4.1 and Gemini 2.5 Pro, for histopathology image classification tasks.
arXiv Detail & Related papers (2025-06-15T01:50:16Z) - Improve Vision Language Model Chain-of-thought Reasoning [86.83335752119741]
Chain-of-thought (CoT) reasoning in vision language models (VLMs) is crucial for improving interpretability and trustworthiness.
We show that training VLM on short answers does not generalize well to reasoning tasks that require more detailed responses.
arXiv Detail & Related papers (2024-10-21T17:00:06Z) - Aligning Modalities in Vision Large Language Models via Preference
Fine-tuning [67.62925151837675]
In this work, we frame the hallucination problem as an alignment issue, tackle it with preference tuning.
Specifically, we propose POVID to generate feedback data with AI models.
We use ground-truth instructions as the preferred response and a two-stage approach to generate dispreferred data.
In experiments across broad benchmarks, we show that we can not only reduce hallucinations, but improve model performance across standard benchmarks, outperforming prior approaches.
arXiv Detail & Related papers (2024-02-18T00:56:16Z) - MoqaGPT : Zero-Shot Multi-modal Open-domain Question Answering with
Large Language Model [33.546564412022754]
MoqaGPT is a framework for multi-modal open-domain question answering.
It retrieves and extracts answers from each modality separately, then fuses this multi-modal information using LLMs to produce a final answer.
On the MultiModalQA dataset, MoqaGPT surpasses the zero-shot baseline, improving F1 by 9.5 points and EM by 10.1 points, and significantly closes the gap with supervised methods.
arXiv Detail & Related papers (2023-10-20T04:09:36Z) - RankVicuna: Zero-Shot Listwise Document Reranking with Open-Source Large
Language Models [56.51705482912727]
We present RankVicuna, the first fully open-source LLM capable of performing high-quality listwise reranking in a zero-shot setting.
Experimental results on the TREC 2019 and 2020 Deep Learning Tracks show that we can achieve effectiveness comparable to zero-shot reranking with GPT-3.5 with a much smaller 7B parameter model, although our effectiveness remains slightly behind reranking with GPT-4.
arXiv Detail & Related papers (2023-09-26T17:31:57Z) - Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in
Vision-Language Models [76.410400238974]
We propose TTA with feedback to rectify the model output and prevent the model from becoming blindly confident.
A CLIP model is adopted as the reward model during TTA and provides feedback for the VLM.
The proposed textitreinforcement learning with CLIP feedback(RLCF) framework is highly flexible and universal.
arXiv Detail & Related papers (2023-05-29T11:03:59Z) - Going Beyond Nouns With Vision & Language Models Using Synthetic Data [43.87754926411406]
Large-scale pre-trained Vision & Language (VL) models have shown remarkable performance in many applications.
Recent works have uncovered a fundamental weakness of these models.
We investigate to which extent purely synthetic data could be leveraged to teach these models to overcome such shortcomings.
arXiv Detail & Related papers (2023-03-30T17:57:43Z) - Large Language Models are Better Reasoners with Self-Verification [48.534270563880845]
Large language models (LLMs) have shown strong reasoning ability in several natural language processing tasks.
LLMs with chain of thought (CoT) prompting require multi-step prompting and multi-token prediction, which is highly sensitive to individual mistakes.
We propose and prove that LLMs also have similar self-verification abilities.
arXiv Detail & Related papers (2022-12-19T15:51:52Z) - Unifying Language Learning Paradigms [96.35981503087567]
We present a unified framework for pre-training models that are universally effective across datasets and setups.
We show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective.
Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization.
arXiv Detail & Related papers (2022-05-10T19:32:20Z) - Proper Value Equivalence [37.565244088924906]
We argue that popular algorithms such as MuZero and Muesli can be understood as minimizing an upper bound for this loss.
We propose a modification to MuZero to propose a modification to MuZero and show that it can lead to improved performance in practice.
arXiv Detail & Related papers (2021-06-18T19:05:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.