Cloud-Device Collaborative Learning for Multimodal Large Language Models
- URL: http://arxiv.org/abs/2312.16279v1
- Date: Tue, 26 Dec 2023 18:46:14 GMT
- Title: Cloud-Device Collaborative Learning for Multimodal Large Language Models
- Authors: Guanqun Wang, Jiaming Liu, Chenxuan Li, Junpeng Ma, Yuan Zhang, Xinyu
Wei, Kevin Zhang, Maurice Chong, Ray Zhang, Yijiang Liu, Shanghang Zhang
- Abstract summary: We introduce a Cloud-Device Collaborative Continual Adaptation framework to enhance the performance of compressed, device-deployed MLLMs.
Our framework is structured into three key components: a device-to-cloud uplink for efficient data transmission, cloud-based knowledge adaptation, and an optimized cloud-to-device downlink for model deployment.
- Score: 24.65882336700547
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The burgeoning field of Multimodal Large Language Models (MLLMs) has
exhibited remarkable performance in diverse tasks such as captioning,
commonsense reasoning, and visual scene understanding. However, the deployment
of these large-scale MLLMs on client devices is hindered by their extensive
model parameters, leading to a notable decline in generalization capabilities
when these models are compressed for device deployment. Addressing this
challenge, we introduce a Cloud-Device Collaborative Continual Adaptation
framework, designed to enhance the performance of compressed, device-deployed
MLLMs by leveraging the robust capabilities of cloud-based, larger-scale MLLMs.
Our framework is structured into three key components: a device-to-cloud uplink
for efficient data transmission, cloud-based knowledge adaptation, and an
optimized cloud-to-device downlink for model deployment. In the uplink phase,
we employ an Uncertainty-guided Token Sampling (UTS) strategy to effectively
filter out-of-distribution tokens, thereby reducing transmission costs and
improving training efficiency. On the cloud side, we propose Adapter-based
Knowledge Distillation (AKD) method to transfer refined knowledge from
large-scale to compressed, pocket-size MLLMs. Furthermore, we propose a Dynamic
Weight update Compression (DWC) strategy for the downlink, which adaptively
selects and quantizes updated weight parameters, enhancing transmission
efficiency and reducing the representational disparity between cloud and device
models. Extensive experiments on several multimodal benchmarks demonstrate the
superiority of our proposed framework over prior Knowledge Distillation and
device-cloud collaboration methods. Notably, we also validate the feasibility
of our approach to real-world experiments.
Related papers
- Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design [59.00758127310582]
We propose a novel framework Read-ME that transforms pre-trained dense LLMs into smaller MoE models.
Our approach employs activation sparsity to extract experts.
Read-ME outperforms other popular open-source dense models of similar scales.
arXiv Detail & Related papers (2024-10-24T19:48:51Z) - ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning [38.26304604660713]
ADEM-VL is an efficient vision-language method that tunes models based on pretrained large language models.
Our framework surpasses existing methods by an average accuracy of 0.77% on ScienceQA dataset.
arXiv Detail & Related papers (2024-10-23T11:31:06Z) - EMMA: Efficient Visual Alignment in Multi-Modal LLMs [56.03417732498859]
EMMA is a lightweight cross-modality module designed to efficiently fuse visual and textual encodings.
EMMA boosts performance across multiple tasks by up to 9.3% while significantly improving robustness against hallucinations.
arXiv Detail & Related papers (2024-10-02T23:00:31Z) - Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models [79.41139393080736]
Large language models (LLMs) have rapidly advanced and demonstrated impressive capabilities.
We propose Reference Trustable Decoding (RTD), a paradigm that allows models to quickly adapt to new tasks without fine-tuning.
arXiv Detail & Related papers (2024-09-30T10:48:20Z) - Contemporary Model Compression on Large Language Models Inference [7.307436175842646]
Large Language Models (LLMs) have revolutionized natural language processing by achieving state-of-the-art results across a variety of tasks.
The computational demands of LLM inference, including high memory consumption and slow processing speeds, pose significant challenges for real-world applications.
This survey explores techniques in model compression that address these challenges by reducing the size and computational requirements of LLMs.
arXiv Detail & Related papers (2024-09-03T15:35:01Z) - Backpropagation-Free Multi-modal On-Device Model Adaptation via Cloud-Device Collaboration [37.456185990843515]
We introduce a Universal On-Device Multi-modal Model Adaptation Framework.
The framework features the Fast Domain Adaptor (FDA) hosted in the cloud, providing tailored parameters for the Lightweight Multi-modal Model on devices.
Our contributions represent a pioneering solution for on-Device Multi-modal Model Adaptation (DMMA)
arXiv Detail & Related papers (2024-05-21T14:42:18Z) - CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion [58.15403987979496]
CREMA is a generalizable, highly efficient, and modular modality-fusion framework for video reasoning.
We propose a novel progressive multimodal fusion design supported by a lightweight fusion module and modality-sequential training strategy.
We validate our method on 7 video-language reasoning tasks assisted by diverse modalities, including VideoQA and Video-Audio/3D/Touch/Thermal QA.
arXiv Detail & Related papers (2024-02-08T18:27:22Z) - ECLM: Efficient Edge-Cloud Collaborative Learning with Continuous
Environment Adaptation [47.35179593006409]
We propose ECLM, an edge-cloud collaborative learning framework for rapid model adaptation for dynamic edge environments.
We show that ECLM significantly improves model performance (e.g., 18.89% accuracy increase) and resource efficiency (e.g. 7.12x communication cost reduction) in adapting models to dynamic edge environments.
arXiv Detail & Related papers (2023-11-18T14:10:09Z) - Device-Cloud Collaborative Learning for Recommendation [50.01289274123047]
We propose a novel MetaPatch learning approach on the device side to efficiently achieve "thousands of people with thousands of models" given a centralized cloud model.
With billions of updated personalized device models, we propose a "model-over-models" distillation algorithm, namely MoMoDistill, to update the centralized cloud model.
arXiv Detail & Related papers (2021-04-14T05:06:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.