See, Think, Confirm: Interactive Prompting Between Vision and Language
Models for Knowledge-based Visual Reasoning
- URL: http://arxiv.org/abs/2301.05226v1
- Date: Thu, 12 Jan 2023 18:59:50 GMT
- Title: See, Think, Confirm: Interactive Prompting Between Vision and Language
Models for Knowledge-based Visual Reasoning
- Authors: Zhenfang Chen, Qinhong Zhou, Yikang Shen, Yining Hong, Hao Zhang,
Chuang Gan
- Abstract summary: We propose a novel framework named Interactive Prompting Visual Reasoner (IPVR) for few-shot knowledge-based visual reasoning.
IPVR contains three stages, see, think and confirm.
We conduct experiments on a range of knowledge-based visual reasoning datasets.
- Score: 60.43585179885355
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Large pre-trained vision and language models have demonstrated remarkable
capacities for various tasks. However, solving the knowledge-based visual
reasoning tasks remains challenging, which requires a model to comprehensively
understand image content, connect the external world knowledge, and perform
step-by-step reasoning to answer the questions correctly. To this end, we
propose a novel framework named Interactive Prompting Visual Reasoner (IPVR)
for few-shot knowledge-based visual reasoning. IPVR contains three stages, see,
think and confirm. The see stage scans the image and grounds the visual concept
candidates with a visual perception model. The think stage adopts a pre-trained
large language model (LLM) to attend to the key concepts from candidates
adaptively. It then transforms them into text context for prompting with a
visual captioning model and adopts the LLM to generate the answer. The confirm
stage further uses the LLM to generate the supporting rationale to the answer,
verify the generated rationale with a cross-modality classifier and ensure that
the rationale can infer the predicted output consistently. We conduct
experiments on a range of knowledge-based visual reasoning datasets. We found
our IPVR enjoys several benefits, 1). it achieves better performance than the
previous few-shot learning baselines; 2). it enjoys the total transparency and
trustworthiness of the whole reasoning process by providing rationales for each
reasoning step; 3). it is computation-efficient compared with other fine-tuning
baselines.
Related papers
- ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom [42.03770972100087]
We introduce a novel visual reasoning framework named ProReason.
ProReason features multi-run proactive perception and decoupled vision-reasoning capabilities.
Our experiments demonstrate that ProReason outperforms both existing multi-step reasoning frameworks and passive peer methods.
arXiv Detail & Related papers (2024-10-18T03:22:06Z) - Instruction Tuning-free Visual Token Complement for Multimodal LLMs [51.138806401996696]
multimodal large language models (MLLMs) have promised an elegant bridge between vision and language.
We propose a Visual Token Complement framework (VTC) that helps MLLMs regain the missing visual features.
Our VTC integrates text-to-image generation as a guide to identifying the text-irrelevant features, and a visual selector is then developed to generate complementary visual tokens.
arXiv Detail & Related papers (2024-08-09T12:13:01Z) - OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding [112.87441334765693]
OMG-LLaVA is a new framework combining powerful pixel-level vision understanding with reasoning abilities.
It can accept various visual and text prompts for flexible user interaction.
OMG-LLaVA achieves image-level, object-level, and pixel-level reasoning and understanding in a single model.
arXiv Detail & Related papers (2024-06-27T17:59:01Z) - Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs [83.24033574914425]
We present Prism, an innovative framework designed to disentangle the perception and reasoning processes involved in visual question solving.
Prism comprises two distinct stages: a perception stage that utilizes a VLM to extract and articulate visual information in textual form, and a reasoning stage that formulates responses based on the extracted visual information.
Our analytical framework provides several valuable insights, underscoring Prism's potential as a cost-effective solution for vision-language tasks.
arXiv Detail & Related papers (2024-06-20T17:54:03Z) - Improving Visual Commonsense in Language Models via Multiple Image Generation [41.565399860320966]
Existing large language models (LLMs) are primarily trained using textual data only.
Visual Language Models, which excel at visually-oriented tasks, often fail at non-visual tasks such as basic commonsense reasoning.
This divergence highlights a critical challenge - the integration of robust visual understanding with foundational text-based language reasoning.
arXiv Detail & Related papers (2024-06-19T15:17:10Z) - Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models [73.40350756742231]
Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning.
Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored.
arXiv Detail & Related papers (2024-02-12T18:21:14Z) - ReSee: Responding through Seeing Fine-grained Visual Knowledge in
Open-domain Dialogue [34.223466503256766]
We provide a new paradigm of constructing multimodal dialogues by splitting visual knowledge into finer granularity.
To boost the accuracy and diversity of augmented visual information, we retrieve them from the Internet or a large image dataset.
By leveraging text and vision knowledge, ReSee can produce informative responses with real-world visual concepts.
arXiv Detail & Related papers (2023-05-23T02:08:56Z) - Enhance Reasoning Ability of Visual-Language Models via Large Language
Models [7.283533791778359]
We propose a method called TReE, which transfers the reasoning ability of a large language model to a visual language model in zero-shot scenarios.
TReE contains three stages: observation, thinking, and re-thinking.
arXiv Detail & Related papers (2023-05-22T17:33:44Z) - Understanding ME? Multimodal Evaluation for Fine-grained Visual
Commonsense [98.70218717851665]
It is unclear whether the models really understand the visual scene and underlying commonsense knowledge due to limited evaluation data resources.
We present a Multimodal Evaluation (ME) pipeline to automatically generate question-answer pairs to test models' understanding of the visual scene, text, and related knowledge.
We then take a step further to show that training with the ME data boosts the model's performance in standard VCR evaluation.
arXiv Detail & Related papers (2022-11-10T21:44:33Z) - KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for Visual
Commonsense Reasoning [4.787501955202053]
In visual commonsense reasoning (VCR) task, a machine must answer correctly and then provide a rationale justifying its answer.
We propose a novel Knowledge Enhanced Visual-and-Linguistic BERT (KVL-BERT for short) model.
Besides taking visual and linguistic contents as input, external commonsense knowledge extracted from ConceptNet is integrated into the multi-layer Transformer.
arXiv Detail & Related papers (2020-12-13T08:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.