Securing the Floor and Raising the Ceiling: A Merging-based Paradigm for Multi-modal Search Agents
- URL: http://arxiv.org/abs/2603.01416v1
- Date: Mon, 02 Mar 2026 03:43:31 GMT
- Title: Securing the Floor and Raising the Ceiling: A Merging-based Paradigm for Multi-modal Search Agents
- Authors: Zhixiang Wang, Jingxuan Xu, Dajun Chen, Yunfang Wu, Wei Jiang, Yong Li,
- Abstract summary: We propose a training-free paradigm to empower Vision-Language Models with autonomous search capabilities.<n>By fusing a text-based search agent with a base VLM, we show that multi-modal search capabilities can be effectively composed without any additional multi-modal training data.
- Score: 20.119608534884858
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in Vision-Language Models (VLMs) have motivated the development of multi-modal search agents that can actively invoke external search tools and integrate retrieved evidence through multi-step reasoning. While promising, existing approaches typically rely on large-scale supervised trajectories or expensive reinforcement learning (RL), leading to high training cost, instability, and a severe cold-start problem for standard VLMs. We propose a training-free paradigm to empower VLMs with autonomous search capabilities via cross-modal model merging. By fusing a text-based search agent with a base VLM, we show that multi-modal search capabilities can be effectively composed without any additional multi-modal training data. To mitigate parameter interference during cross-modal integration, we introduce Optimal Brain Merging (OBM), a saliency-aware merging algorithm that identifies task-critical parameters based on their impact on model loss using only a small set of calibration samples. Extensive experiments on search-intensive benchmarks (e.g., InfoSeek, MMSearch) reveal that: (1) Model merging secures a reasonable performance floor as a zero-shot agent, with OBM achieving superior search rates; (2) OBM significantly raises the performance ceiling as a warm-start strategy, achieving faster convergence and higher peak accuracy than standard VLM initialization.
Related papers
- VSearcher: Long-Horizon Multimodal Search Agent via Reinforcement Learning [22.27364585438247]
VSearcher is a multimodal search agent capable of long-horizon, multi-turn tool use in real-world web environments.<n>We introduce Iterative Injection Data Synthesis pipeline to generate large-scale, complex multimodal QA questions.<n>We then adopt an SFT-then-RL training pipeline to turn base multimodal models to agent capable of multi-turn tool calling in real-world web environments.
arXiv Detail & Related papers (2026-03-03T09:33:22Z) - M$^3$Searcher: Modular Multimodal Information Seeking Agency with Retrieval-Oriented Reasoning [8.546005018618713]
M$3$Searcher is a modular multimodal information-seeking agent.<n>M$3$Searcher is optimized with a retrieval-oriented multi-objective reward.<n> MMSearchVQA is a multimodal multi-hop dataset to support retrieval centric RL training.
arXiv Detail & Related papers (2026-01-14T08:27:40Z) - Beyond Monolithic Architectures: A Multi-Agent Search and Knowledge Optimization Framework for Agentic Search [56.78490647843876]
Agentic search has emerged as a promising paradigm for complex information seeking by enabling Large Language Models (LLMs) to interleave reasoning with tool use.<n>We propose bfM-ASK, a framework that explicitly decouples agentic search into two complementary roles: Search Behavior Agents, which plan and execute search actions, and Knowledge Management Agents, which aggregate, filter, and maintain a compact internal context.
arXiv Detail & Related papers (2026-01-08T08:13:27Z) - SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning [57.083359974905655]
SenseNova-MARS is a novel Multimodal Agentic Reasoning and Search framework.<n>It dynamically integrates the image search, text search, and image crop tools to tackle knowledge-intensive visual understanding challenges.<n> SenseNova-MARS achieves state-of-the-art performance on open-source search and fine-grained image understanding benchmarks.
arXiv Detail & Related papers (2025-12-30T16:31:45Z) - Agentic Reinforced Policy Optimization [66.96989268893932]
Large-scale reinforcement learning with verifiable rewards (RLVR) has demonstrated its effectiveness in harnessing the potential of large language models (LLMs) for single-turn reasoning tasks.<n>Current RL algorithms inadequately balance the models' intrinsic long-horizon reasoning capabilities and their proficiency in multi-turn tool interactions.<n>We propose Agentic Reinforced Policy Optimization (ARPO), a novel agentic RL algorithm tailored for training multi-turn LLM-based agents.
arXiv Detail & Related papers (2025-07-26T07:53:11Z) - MMSearch-R1: Incentivizing LMMs to Search [49.889749277236376]
We present MMSearch-R1, the first end-to-end reinforcement learning framework that enables on-demand, multi-turn search in real-world Internet environments.<n>Our framework integrates both image and text search tools, allowing the model to reason about when and how to invoke them guided by an outcome-based reward with a search penalty.
arXiv Detail & Related papers (2025-06-25T17:59:42Z) - OptMerge: Unifying Multimodal LLM Capabilities and Modalities via Model Merging [124.91183814854126]
Model merging seeks to combine multiple expert models into a single model.<n>We introduce a benchmark for model merging research that clearly divides the tasks for MLLM training and evaluation.<n>We find that model merging offers a promising way for building improved MLLMs without requiring training data.
arXiv Detail & Related papers (2025-05-26T12:23:14Z) - ZeroSearch: Incentivize the Search Capability of LLMs without Searching [69.55482019211597]
We introduce ZeroSearch, a framework that incentivizes the capabilities of large language models to use a real search engine with simulated searches during training.<n>Our approach begins with lightweight supervised fine-tuning to transform the LLM into a retrieval module capable of generating both useful and noisy documents.
arXiv Detail & Related papers (2025-05-07T17:30:22Z) - Progressive Multimodal Reasoning via Active Retrieval [64.74746997923967]
Multi-step multimodal reasoning tasks pose significant challenges for large language models (MLLMs)<n>We propose AR-MCTS, a universal framework designed to progressively improve the reasoning capabilities of MLLMs.<n>We show that AR-MCTS can optimize sampling diversity and accuracy, yielding reliable multimodal reasoning.
arXiv Detail & Related papers (2024-12-19T13:25:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.