Related papers: Multi-Agent Cooperative Learning for Robust Vision-Language Alignment under OOD Concepts

Multi-Agent Cooperative Learning for Robust Vision-Language Alignment under OOD Concepts

URL: http://arxiv.org/abs/2601.09746v1
Date: Sun, 11 Jan 2026 20:36:47 GMT
Title: Multi-Agent Cooperative Learning for Robust Vision-Language Alignment under OOD Concepts
Authors: Philip Xu, Isabel Wagner, Eerke Boiten,
Abstract summary: This paper introduces a novel Multi-Agent Cooperative Learning framework to address cross-modal alignment collapse in vision-language models.<n>Experiments on the VISTA-Beyond dataset demonstrate that MACL significantly improves performance in both few-shot and zero-shot settings.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper introduces a novel Multi-Agent Cooperative Learning (MACL) framework to address cross-modal alignment collapse in vision-language models when handling out-of-distribution (OOD) concepts. Four core agents, including image, text, name, and coordination agents, collaboratively mitigate modality imbalance through structured message passing. The proposed framework enables multi-agent feature space name learning, incorporates a context exchange enhanced few-shot learning algorithm, and adopts an adaptive dynamic balancing mechanism to regulate inter-agent contributions. Experiments on the VISTA-Beyond dataset demonstrate that MACL significantly improves performance in both few-shot and zero-shot settings, achieving 1-5% precision gains across diverse visual domains.

Related papers

Multi-Paradigm Collaborative Adversarial Attack Against Multi-Modal Large Language Models [67.45032003041399]
We propose a novel Multi-Paradigm Collaborative Attack (MPCAttack) framework to boost the transferability of adversarial examples against MLLMs.<n>MPCO adaptively balances the importance of different paradigm representations and guides the global optimisation.<n>Our solution consistently outperforms state-of-the-art methods in both targeted and untargeted attacks on open-source and closed-source MLLMs.
arXiv Detail & Related papers (2026-03-05T06:01:26Z)
Reconstructing Content via Collaborative Attention to Improve Multimodal Embedding Quality [59.651410243721045]
CoCoA is a Content reconstruction pre-training paradigm based on Collaborative Attention for multimodal embedding optimization.<n>We introduce an EOS-based reconstruction task, encouraging the model to reconstruct input from the corresponding EOS> embeddings.<n>Experiments on MMEB-V1 demonstrate that CoCoA built upon Qwen2-VL and Qwen2.5-VL significantly improves embedding quality.
arXiv Detail & Related papers (2026-03-02T05:34:45Z)
Multi-Agent Model-Based Reinforcement Learning with Joint State-Action Learned Embeddings [10.36125908359289]
We present a novel model-based multi-agent reinforcement learning framework.<n>We design a world model trained with variational auto-encoders and augment the model using the state-action learned embedding.<n>By coupling imagined trajectories with SALE-based action values, the agents acquire a richer understanding of how their choices influence collective outcomes.
arXiv Detail & Related papers (2026-02-13T01:57:21Z)
Bridging the Semantic Chasm: Synergistic Conceptual Anchoring for Generalized Few-Shot and Zero-Shot OOD Perception [39.37877716254272]
This manuscript presents a pioneering Synergistic Neural Agents Network (SynerNet) framework designed to mitigate the phenomenon of cross-modal alignment degeneration.<n>Four specialized computational units - visual perception, linguistic context, nominal embedding, and global coordination - collaboratively rectify modality disparities.<n> Empirical evaluations conducted on the VISTA-Beyond benchmark demonstrate that SynerNet yields substantial performance augmentations in both few-shot and zero-shot scenarios.
arXiv Detail & Related papers (2026-01-30T21:44:33Z)
Compression then Matching: An Efficient Pre-training Paradigm for Multimodal Embedding [53.18433310890516]
Vision-language models advance multimodal representation learning by acquiring transferable semantic embeddings.<n>We propose CoMa, a compressed pre-training phase, which serves as a warm-up stage for contrastive learning.
arXiv Detail & Related papers (2025-11-11T17:23:02Z)
Parallelism Meets Adaptiveness: Scalable Documents Understanding in Multi-Agent LLM Systems [0.8437187555622164]
Large language model (LLM) agents have shown increasing promise for collaborative task completion.<n>Existing multi-agent frameworks often rely on static, fixed roles, and limited inter-agent communication.<n>This paper proposes a coordination framework that enables adaptiveness through three core mechanisms.
arXiv Detail & Related papers (2025-07-22T22:42:51Z)
Interpretable Few-Shot Image Classification via Prototypical Concept-Guided Mixture of LoRA Experts [79.18608192761512]
Self-Explainable Models (SEMs) rely on Prototypical Concept Learning (PCL) to enable their visual recognition processes more interpretable.<n>We propose a Few-Shot Prototypical Concept Classification framework that mitigates two key challenges under low-data regimes: parametric imbalance and representation misalignment.<n>Our approach consistently outperforms existing SEMs by a notable margin, with 4.2%-8.7% relative gains in 5-way 5-shot classification.
arXiv Detail & Related papers (2025-06-05T06:39:43Z)
Collab: Controlled Decoding using Mixture of Agents for LLM Alignment [90.6117569025754]
Reinforcement learning from human feedback has emerged as an effective technique to align Large Language models.<n>Controlled Decoding provides a mechanism for aligning a model at inference time without retraining.<n>We propose a mixture of agent-based decoding strategies leveraging the existing off-the-shelf aligned LLM policies.
arXiv Detail & Related papers (2025-03-27T17:34:25Z)
Distributional Vision-Language Alignment by Cauchy-Schwarz Divergence [83.15764564701706]
We propose a novel framework that performs vision-language alignment by integrating Cauchy-Schwarz divergence with mutual information.<n>We find that the CS divergence seamlessly addresses the InfoNCE's alignment-uniformity conflict and serves complementary roles with InfoNCE.<n> Experiments on text-to-image generation and cross-modality retrieval tasks demonstrate the effectiveness of our method on vision-language alignment.
arXiv Detail & Related papers (2025-02-24T10:29:15Z)
Cooperative Multi-Agent Planning with Adaptive Skill Synthesis [16.228784877899976]
We present a novel multi-agent architecture that integrates vision-language models (VLMs) with a dynamic skill library and structured communication for decentralized closed-loop decision-making.<n>The skill library, bootstrapped from demonstrations, evolves via planner-guided tasks to enable adaptive strategies.<n>We demonstrate its strong performance against state-of-the-art MARL baselines across both symmetric and asymmetric scenarios.
arXiv Detail & Related papers (2025-02-14T13:23:18Z)
Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning [79.38140606606126]
We propose an algorithmic framework that fine-tunes vision-language models (VLMs) with reinforcement learning (RL) Our framework provides a task description and then prompts the VLM to generate chain-of-thought (CoT) reasoning. We demonstrate that our proposed framework enhances the decision-making capabilities of VLM agents across various tasks.
arXiv Detail & Related papers (2024-05-16T17:50:19Z)
CMAT: A Multi-Agent Collaboration Tuning Framework for Enhancing Small Language Models [8.793510432881803]
We introduce the TinyAgent model, trained on a meticulously curated high-quality dataset.<n>We also present the Collaborative Multi-Agent Tuning (CMAT) framework, an innovative system designed to augment language agent capabilities.<n>In this research, we propose a new communication agent framework that integrates multi-agent systems with environmental feedback mechanisms.
arXiv Detail & Related papers (2024-04-02T06:07:35Z)
APoLLo: Unified Adapter and Prompt Learning for Vision Language Models [58.9772868980283]
We present APoLLo, a unified multi-modal approach that combines Adapter and Prompt learning for Vision-Language models. APoLLo achieves a relative gain up to 6.03% over MaPLe (SOTA) on novel classes for 10 diverse image recognition datasets.
arXiv Detail & Related papers (2023-12-04T01:42:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.