Small-Large Collaboration: Training-efficient Concept Personalization for Large VLM using a Meta Personalized Small VLM
- URL: http://arxiv.org/abs/2508.07260v1
- Date: Sun, 10 Aug 2025 09:24:31 GMT
- Title: Small-Large Collaboration: Training-efficient Concept Personalization for Large VLM using a Meta Personalized Small VLM
- Authors: Sihan Yang, Huitong Ji, Shaolin Lu, Jiayi Chen, Binxiao Xu, Ming Lu, Yuanxing Zhang, Wenhui Dong, Wentao Zhang,
- Abstract summary: We propose a novel collaborative framework named Small-Large Collaboration (SLC) for large VLM personalization.<n>We develop a test-time reflection strategy, preventing the potential hallucination of the small VLM.<n>To the best of our knowledge, this is the first training-efficient framework that supports both open-source and closed-source large VLMs.
- Score: 27.081774497698667
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Personalizing Vision-Language Models (VLMs) to transform them into daily assistants has emerged as a trending research direction. However, leading companies like OpenAI continue to increase model size and develop complex designs such as the chain of thought (CoT). While large VLMs are proficient in complex multi-modal understanding, their high training costs and limited access via paid APIs restrict direct personalization. Conversely, small VLMs are easily personalized and freely available, but they lack sufficient reasoning capabilities. Inspired by this, we propose a novel collaborative framework named Small-Large Collaboration (SLC) for large VLM personalization, where the small VLM is responsible for generating personalized information, while the large model integrates this personalized information to deliver accurate responses. To effectively incorporate personalized information, we develop a test-time reflection strategy, preventing the potential hallucination of the small VLM. Since SLC only needs to train a meta personalized small VLM for the large VLMs, the overall process is training-efficient. To the best of our knowledge, this is the first training-efficient framework that supports both open-source and closed-source large VLMs, enabling broader real-world personalized applications. We conduct thorough experiments across various benchmarks and large VLMs to demonstrate the effectiveness of the proposed SLC framework. The code will be released at https://github.com/Hhankyangg/SLC.
Related papers
- Open-Source Multimodal Moxin Models with Moxin-VLM and Moxin-VLA [53.68989489261506]
Moxin 7B is introduced as a fully open-source Large Language Models (LLMs)<n>We develop three variants based on Moxin, including Moxin-VLM, Moxin-VLA, and Moxin-Chinese.<n> Experiments show that our models achieve superior performance in various evaluations.
arXiv Detail & Related papers (2025-12-22T02:36:42Z) - Be My Eyes: Extending Large Language Models to New Modalities Through Multi-Agent Collaboration [35.429026246760635]
BeMyEyes is a modular framework for extending Large Language Models (LLMs) to multimodal reasoning.<n>By combining the complementary strengths of perception and reasoning agents, BeMyEyes avoids the need for training large-scale multimodal models.<n> Experiments show that our framework unlocks the multimodal reasoning capabilities for LLMs.
arXiv Detail & Related papers (2025-11-24T18:55:16Z) - GenRecal: Generation after Recalibration from Large to Small Vision-Language Models [63.27511432647797]
Vision-language models (VLMs) have leveraged large language models (LLMs) to achieve performance on par with closed-source systems like GPT-4V.<n>Recent advancements in vision-language models (VLMs) have leveraged large language models (LLMs) to achieve performance on par with closed-source systems like GPT-4V.
arXiv Detail & Related papers (2025-06-18T17:59:49Z) - VLM Q-Learning: Aligning Vision-Language Models for Interactive Decision-Making [45.02997774119763]
Vision-language models (VLMs) extend large language models (LLMs) to multi-modal data.<n>Our work approaches these challenges from an offline-to-online reinforcement learning (RL) perspective.
arXiv Detail & Related papers (2025-05-06T04:51:57Z) - RAP: Retrieval-Augmented Personalization for Multimodal Large Language Models [53.304699445700926]
We introduce the Retrieval Augmented Personalization framework for MLLMs' personalization.<n>Starting from a general MLLM, we turn it into a personalized assistant in three steps.<n>By pretraining on large-scale dataset, RAP-MLLMs can generalize to infinite visual concepts without additional finetuning.
arXiv Detail & Related papers (2024-10-17T09:10:26Z) - NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks.
We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities.
We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z) - Are Bigger Encoders Always Better in Vision Large Models? [21.797332686137203]
multimodal large language models (MLLMs) have shown strong potential in real-world applications.
The scaling trend of vision language models (VLMs) under the current mainstream paradigm has not been extensively studied.
We conduct experiments on the pretraining stage of MLLMs using different encoder sizes and large language model (LLM) sizes.
arXiv Detail & Related papers (2024-08-01T15:05:42Z) - MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series [86.31735321970481]
We open-source MAP-Neo, a bilingual language model with 7B parameters trained from scratch on 4.5T high-quality tokens.
Our MAP-Neo is the first fully open-sourced bilingual LLM with comparable performance compared to existing state-of-the-art LLMs.
arXiv Detail & Related papers (2024-05-29T17:57:16Z) - MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT [87.4910758026772]
"Bigger the better" has been the predominant trend in recent Large Language Models (LLMs) development.
This paper explores the "less is more" paradigm by addressing the challenge of designing accurate yet efficient Small Language Models (SLMs) for resource constrained devices.
arXiv Detail & Related papers (2024-02-26T18:59:03Z) - Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning [12.628697648945298]
Reinforcement learning (RL) requires either manually specifying a reward function, or learning a reward model from a large amount of human feedback.
We study a more sample-efficient alternative: using pretrained vision-language models (VLMs) as zero-shot reward models (RMs) to specify tasks via natural language.
arXiv Detail & Related papers (2023-10-19T17:17:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.