Multi-Agents Based on Large Language Models for Knowledge-based Visual Question Answering
- URL: http://arxiv.org/abs/2412.18351v1
- Date: Tue, 24 Dec 2024 11:24:56 GMT
- Title: Multi-Agents Based on Large Language Models for Knowledge-based Visual Question Answering
- Authors: Zhongjian Hu, Peng Yang, Bing Li, Zhenqi Wang,
- Abstract summary: We propose a voting framework for knowledge-based Visual Question Answering.
We design three agents that simulate different levels of staff in a team, and assign the available tools according to the levels.
Experiments on OK-VQA and A-OKVQA show that our approach outperforms other baselines by 2.2 and 1.0, respectively.
- Score: 6.6897007888321465
- License:
- Abstract: Large Language Models (LLMs) have achieved impressive results in knowledge-based Visual Question Answering (VQA). However existing methods still have challenges: the inability to use external tools autonomously, and the inability to work in teams. Humans tend to know whether they need to use external tools when they encounter a new question, e.g., they tend to be able to give a direct answer to a familiar question, whereas they tend to use tools such as search engines when they encounter an unfamiliar question. In addition, humans also tend to collaborate and discuss with others to get better answers. Inspired by this, we propose the multi-agent voting framework. We design three LLM-based agents that simulate different levels of staff in a team, and assign the available tools according to the levels. Each agent provides the corresponding answer, and finally all the answers provided by the agents are voted to get the final answer. Experiments on OK-VQA and A-OKVQA show that our approach outperforms other baselines by 2.2 and 1.0, respectively.
Related papers
- Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent [102.31558123570437]
Multimodal Retrieval Augmented Generation (mRAG) plays an important role in mitigating the "hallucination" issue inherent in multimodal large language models (MLLMs)
We propose the first self-adaptive planning agent for multimodal retrieval, OmniSearch.
arXiv Detail & Related papers (2024-11-05T09:27:21Z) - Multimodal Reranking for Knowledge-Intensive Visual Question Answering [77.24401833951096]
We introduce a multi-modal reranker to improve the ranking quality of knowledge candidates for answer generation.
Experiments on OK-VQA and A-OKVQA show that multi-modal reranker from distant supervision provides consistent improvements.
arXiv Detail & Related papers (2024-07-17T02:58:52Z) - StackRAG Agent: Improving Developer Answers with Retrieval-Augmented Generation [2.225268436173329]
StackRAG is a retrieval-augmented Multiagent generation tool based on Large Language Models.
It combines the two worlds: aggregating the knowledge from SO to enhance the reliability of the generated answers.
Initial evaluations show that the generated answers are correct, accurate, relevant, and useful.
arXiv Detail & Related papers (2024-06-19T21:07:35Z) - Multi-LLM QA with Embodied Exploration [55.581423861790945]
We investigate the use of Multi-Embodied LLM Explorers (MELE) for question-answering in an unknown environment.
Multiple LLM-based agents independently explore and then answer queries about a household environment.
We analyze different aggregation methods to generate a single, final answer for each query.
arXiv Detail & Related papers (2024-06-16T12:46:40Z) - Don't Just Say "I don't know"! Self-aligning Large Language Models for Responding to Unknown Questions with Explanations [70.6395572287422]
Self-alignment method is capable of not only refusing to answer but also providing explanation to the unanswerability of unknown questions.
We conduct disparity-driven self-curation to select qualified data for fine-tuning the LLM itself for aligning the responses to unknown questions as desired.
arXiv Detail & Related papers (2024-02-23T02:24:36Z) - One Agent Too Many: User Perspectives on Approaches to Multi-agent
Conversational AI [10.825570464035872]
We show that users have a significant preference for abstracting agent orchestration in both system usability and system performance.
We demonstrate that this mode of interaction is able to provide quality responses that are rated within 1% of human-selected answers.
arXiv Detail & Related papers (2024-01-13T17:30:57Z) - Towards Top-Down Reasoning: An Explainable Multi-Agent Approach for Visual Question Answering [47.668572102657684]
This work introduces a novel, explainable multi-agent collaboration framework by leveraging the expansive knowledge of Large Language Models (LLMs) to enhance the capabilities of Vision Language Models (VLMs)
arXiv Detail & Related papers (2023-11-29T03:10:42Z) - Asking for Knowledge: Training RL Agents to Query External Knowledge
Using Language [121.56329458876655]
We introduce two new environments: the grid-world-based Q-BabyAI and the text-based Q-TextWorld.
We propose the "Asking for Knowledge" (AFK) agent, which learns to generate language commands to query for meaningful knowledge.
arXiv Detail & Related papers (2022-05-12T14:20:31Z) - MultiModalQA: Complex Question Answering over Text, Tables and Images [52.25399438133274]
We present MultiModalQA: a dataset that requires joint reasoning over text, tables and images.
We create MMQA using a new framework for generating complex multi-modal questions at scale.
We then define a formal language that allows us to take questions that can be answered from a single modality, and combine them to generate cross-modal questions.
arXiv Detail & Related papers (2021-04-13T09:14:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.