Distributed Multi-Agent Coordination Using Multi-Modal Foundation Models
- URL: http://arxiv.org/abs/2501.14189v1
- Date: Fri, 24 Jan 2025 02:50:21 GMT
- Title: Distributed Multi-Agent Coordination Using Multi-Modal Foundation Models
- Authors: Saaduddin Mahmud, Dorian Benhamou Goldfajn, Shlomo Zilberstein,
- Abstract summary: Distributed Constraint Optimization Problems (DCOPs) offer a powerful framework for multi-agent coordination but often rely on labor-intensive, manual problem construction.<n>We introduce a framework that takes advantage of large multimodal foundation models (LFMs) to automatically generate constraints from both visual and linguistic instructions.<n>We evaluate these agent archetypes using state-of-the-art LLMs (large language models) and VLMs (vision language models) on three novel VL-DCOP tasks and compare their respective advantages and drawbacks.
- Score: 9.37268652939886
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Distributed Constraint Optimization Problems (DCOPs) offer a powerful framework for multi-agent coordination but often rely on labor-intensive, manual problem construction. To address this, we introduce VL-DCOPs, a framework that takes advantage of large multimodal foundation models (LFMs) to automatically generate constraints from both visual and linguistic instructions. We then introduce a spectrum of agent archetypes for solving VL-DCOPs: from a neuro-symbolic agent that delegates some of the algorithmic decisions to an LFM, to a fully neural agent that depends entirely on an LFM for coordination. We evaluate these agent archetypes using state-of-the-art LLMs (large language models) and VLMs (vision language models) on three novel VL-DCOP tasks and compare their respective advantages and drawbacks. Lastly, we discuss how this work extends to broader frontier challenges in the DCOP literature.
Related papers
- Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective [42.832839189236694]
We propose MAMMQA, a multi-agent QA framework for multimodal inputs spanning text, tables, and images.<n>Our system includes two Visual Language Model (VLM) agents and one text-based Large Language Model (LLM) agent.<n> Experiments on diverse multimodal QA benchmarks demonstrate that our cooperative, multi-agent framework consistently outperforms existing baselines in both accuracy and robustness.
arXiv Detail & Related papers (2025-05-27T07:23:38Z) - Unifying Multimodal Large Language Model Capabilities and Modalities via Model Merging [103.98582374569789]
Model merging aims to combine multiple expert models into a single model, thereby reducing storage and serving costs.<n>Previous studies have primarily focused on merging visual classification models or Large Language Models (LLMs) for code and math tasks.<n>We introduce the model merging benchmark for MLLMs, which includes multiple tasks such as VQA, Geometry, Chart, OCR, and Grounding, providing both LoRA and full fine-tuning models.
arXiv Detail & Related papers (2025-05-26T12:23:14Z) - QLLM: Do We Really Need a Mixing Network for Credit Assignment in Multi-Agent Reinforcement Learning? [4.429189958406034]
Credit assignment has remained a fundamental challenge in multi-agent reinforcement learning (MARL)
We propose a novel algorithm, textbfQLLM, which facilitates the automatic construction of credit assignment functions using large language models (LLMs)
Extensive experiments conducted on several standard MARL benchmarks demonstrate that the proposed method consistently outperforms existing state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-17T14:07:11Z) - A Survey on Mechanistic Interpretability for Multi-Modal Foundation Models [74.48084001058672]
The rise of foundation models has transformed machine learning research.
multimodal foundation models (MMFMs) pose unique interpretability challenges beyond unimodal frameworks.
This survey explores two key aspects: (1) the adaptation of LLM interpretability methods to multimodal models and (2) understanding the mechanistic differences between unimodal language models and crossmodal systems.
arXiv Detail & Related papers (2025-02-22T20:55:26Z) - MALMM: Multi-Agent Large Language Models for Zero-Shot Robotics Manipulation [52.739500459903724]
Large Language Models (LLMs) have demonstrated remarkable planning abilities across various domains, including robotics manipulation and navigation.
We propose a novel multi-agent LLM framework that distributes high-level planning and low-level control code generation across specialized LLM agents.
We evaluate our approach on nine RLBench tasks, including long-horizon tasks, and demonstrate its ability to solve robotics manipulation in a zero-shot setting.
arXiv Detail & Related papers (2024-11-26T17:53:44Z) - Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models [64.1799100754406]
Large Language Models (LLMs) demonstrate enhanced capabilities and reliability by reasoning more.
Despite various efforts to improve LLM reasoning, high-quality long-chain reasoning data and optimized training pipelines still remain inadequately explored in vision-language tasks.
We present Insight-V, an early effort to 1) scalably produce long and robust reasoning data for complex multi-modal tasks, and 2) an effective training pipeline to enhance the reasoning capabilities of MLLMs.
arXiv Detail & Related papers (2024-11-21T18:59:55Z) - MaCTG: Multi-Agent Collaborative Thought Graph for Automatic Programming [10.461509044478278]
MaCTG (MultiAgent Collaborative Thought Graph) is a novel multi-agent framework that employs a dynamic graph structure.
It autonomously assigns agent roles based on programming requirements, dynamically refines task distribution, and systematically verifies and integrates project-level code.
MaCTG significantly reduced operational costs by 89.09% compared to existing multi-agent frameworks.
arXiv Detail & Related papers (2024-10-25T01:52:15Z) - Large Multimodal Agents: A Survey [78.81459893884737]
Large language models (LLMs) have achieved superior performance in powering text-based AI agents.
There is an emerging research trend focused on extending these LLM-powered AI agents into the multimodal domain.
This review aims to provide valuable insights and guidelines for future research in this rapidly evolving field.
arXiv Detail & Related papers (2024-02-23T06:04:23Z) - Model Composition for Multimodal Large Language Models [71.5729418523411]
We propose a new paradigm through the model composition of existing MLLMs to create a new model that retains the modal understanding capabilities of each original model.
Our basic implementation, NaiveMC, demonstrates the effectiveness of this paradigm by reusing modality encoders and merging LLM parameters.
arXiv Detail & Related papers (2024-02-20T06:38:10Z) - Controlling Large Language Model-based Agents for Large-Scale
Decision-Making: An Actor-Critic Approach [28.477463632107558]
We develop a modular framework called LLaMAC to address hallucination in Large Language Models and coordination in Multi-Agent Systems.
LLaMAC implements a value distribution encoding similar to that found in the human brain, utilizing internal and external feedback mechanisms to facilitate collaboration and iterative reasoning among its modules.
arXiv Detail & Related papers (2023-11-23T10:14:58Z) - Theory of Mind for Multi-Agent Collaboration via Large Language Models [5.2767999863286645]
This study evaluates Large Language Models (LLMs)-based agents in a multi-agent cooperative text game with Theory of Mind (ToM) inference tasks.
We observed evidence of emergent collaborative behaviors and high-order Theory of Mind capabilities among LLM-based agents.
arXiv Detail & Related papers (2023-10-16T07:51:19Z) - On the Performance of Multimodal Language Models [4.677125897916577]
This study conducts a comparative analysis of different multimodal instruction tuning approaches.
We reveal key insights for guiding architectural choices when incorporating multimodal capabilities into large language models.
arXiv Detail & Related papers (2023-10-04T23:33:36Z) - Corex: Pushing the Boundaries of Complex Reasoning through Multi-Model Collaboration [83.4031923134958]
Corex is a suite of novel general-purpose strategies that transform Large Language Models into autonomous agents.
Inspired by human behaviors, Corex is constituted by diverse collaboration paradigms including Debate, Review, and Retrieve modes.
We demonstrate that orchestrating multiple LLMs to work in concert yields substantially better performance compared to existing methods.
arXiv Detail & Related papers (2023-09-30T07:11:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.