Related papers: Fine-tuning Multimodal Transformers on Edge: A Parallel Split Learning Approach

Fine-tuning Multimodal Transformers on Edge: A Parallel Split Learning Approach

URL: http://arxiv.org/abs/2502.06355v1
Date: Mon, 10 Feb 2025 11:10:41 GMT
Title: Fine-tuning Multimodal Transformers on Edge: A Parallel Split Learning Approach
Authors: Timo Fudala, Vasileios Tsouvalas, Nirvana Meratnia,
Abstract summary: Split Learning partitions models at a designated cut-layer to offload compute-intensive operations to the server.<n>We present MPSL, a parallel SL approach for computational efficient fine-tuning of multimodal transformers in a distributed manner.<n>MPSL employs lightweight client-side tokenizers and a unified modality-agnostic encoder, allowing flexible adaptation to task-specific needs.
Score: 1.297210402524609
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal transformers integrate diverse data types like images, audio, and text, advancing tasks such as audio-visual understanding and image-text retrieval; yet their high parameterization limits deployment on resource-constrained edge devices. Split Learning (SL), which partitions models at a designated cut-layer to offload compute-intensive operations to the server, offers a promising approach for distributed training of multimodal transformers, though its application remains underexplored. We present MPSL, a parallel SL approach for computational efficient fine-tuning of multimodal transformers in a distributed manner, while eliminating label sharing, client synchronization, and per-client sub-model management. MPSL employs lightweight client-side tokenizers and a unified modality-agnostic encoder, allowing flexible adaptation to task-specific needs. Our evaluation across 7 multimodal datasets demonstrates that MPSL matches or outperforms Federated Learning, reduces client-side computations by 250x, and achieves superior scalability in communication cost with model growth. Through extensive analysis, we highlight task suitability, trade-offs, and scenarios where MPSL excels, inspiring further exploration.

Related papers

ElasticMM: Efficient Multimodal LLMs Serving with Elastic Multimodal Parallelism [9.93378263858092]
Multimodal large language models (MLLMs) handle images, videos, and audio by incorporating feature extractors and projection modules.<n>Current tightly coupled serving architectures struggle to distinguish between mixed request types.<n>We propose Elastic Multimodal Parallelism (EMP), a new serving paradigm that elastically adapts to resource heterogeneity.
arXiv Detail & Related papers (2025-07-14T08:53:48Z)
FedNano: Toward Lightweight Federated Tuning for Pretrained Multimodal Large Language Models [29.772622964516028]
Federated Learning (FL) offers a solution by enabling collaborative model training without centralizing data.<n>Existing FL methods assume client-side deployment of full models, an assumption that breaks down for large-scale MLLMs.<n>We propose FedNano, the first FL framework that centralizes the LLM on the server while introducing NanoEdge, a lightweight module for client-specific adaptation.
arXiv Detail & Related papers (2025-06-12T17:50:50Z)
Efficient Multi-modal Long Context Learning for Training-free Adaptation [96.21248144937627]
This paper introduces Efficient Multi-Modal Long Context Learning (EMLoC)<n>It embeds demonstration examples directly into the model input.<n>It condenses long-context multimodal inputs into compact, task-specific memory representations.
arXiv Detail & Related papers (2025-05-26T10:49:44Z)
Token Communication-Driven Multimodal Large Models in Resource-Constrained Multiuser Networks [7.137830911253685]
multimodal large models pose challenges for deploying intelligent applications at the wireless edge.<n>These constraints manifest as limited bandwidth, computational capacity, and stringent latency requirements.<n>We propose a token communication paradigm that facilitates decentralized proliferations across user devices and edge infrastructure.
arXiv Detail & Related papers (2025-05-06T14:17:05Z)
PLM: Efficient Peripheral Language Models Hardware-Co-Designed for Ubiquitous Computing [48.30406812516552]
We introduce the PLM, a Peripheral Language Model, developed through a co-design process that jointly optimize model architecture and edge system constraints. PLM employs a Multi-head Latent Attention mechanism and employs the squared ReLU activation function to encourage sparsity, thereby reducing peak memory footprint. evaluation results demonstrate that PLM outperforms existing small language models trained on publicly available data.
arXiv Detail & Related papers (2025-03-15T15:11:17Z)
Adaptive Prototype Knowledge Transfer for Federated Learning with Mixed Modalities and Heterogeneous Tasks [12.67996108615162]
We propose an Adaptive prototype-based Multimodal Federated Learning (AproMFL) framework for mixed modalities and heterogeneous tasks.<n>Our AproMFL transfers knowledge through adaptively-constructed prototypes without a prior public dataset.<n>Clients adaptively select prototype construction methods in line with tasks; server converts client prototypes into unified multimodal prototypes and aggregates them to form global prototypes.
arXiv Detail & Related papers (2025-02-06T07:28:05Z)
Pilot: Building the Federated Multimodal Instruction Tuning Framework [79.56362403673354]
Our framework integrates two stages of "adapter on adapter" into the connector of the vision encoder and the LLM.<n>In stage 1, we extract task-specific features and client-specific features from visual information.<n>In stage 2, we build the cross-task Mixture-of-Adapters(CT-MoA) module to perform cross-task interaction.
arXiv Detail & Related papers (2025-01-23T07:49:24Z)
StitchFusion: Weaving Any Visual Modalities to Enhance Multimodal Semantic Segmentation [63.31007867379312]
We propose StitchFusion, a framework that integrates large-scale pre-trained models directly as encoders and feature fusers. We introduce a multi-directional adapter module (MultiAdapter) to enable cross-modal information transfer during encoding. Our model achieves state-of-the-art performance on four multi-modal segmentation datasets with minimal additional parameters.
arXiv Detail & Related papers (2024-08-02T15:41:16Z)
MP-SL: Multihop Parallel Split Learning [2.7716102039510564]
Multihop Parallel SL (MP-SL) is a modular and Machine Learning as a Service (ML) framework designed to facilitate the involvement of resource-constrained devices. MP-SL supports multihop Parallel SL-based training. This involves splitting the model into multiple parts and utilizing multiple compute nodes in a pipelined manner.
arXiv Detail & Related papers (2024-01-31T22:09:40Z)
Deformable Mixer Transformer with Gating for Multi-Task Learning of Dense Prediction [126.34551436845133]
CNNs and Transformers have their own advantages and both have been widely used for dense prediction in multi-task learning (MTL) We present a novel MTL model by combining both merits of deformable CNN and query-based Transformer with shared gating for multi-task learning of dense prediction.
arXiv Detail & Related papers (2023-08-10T17:37:49Z)
PFSL: Personalized & Fair Split Learning with Data & Label Privacy for thin clients [0.5144809478361603]
PFSL is a new framework of distributed split learning where a large number of thin clients perform transfer learning in parallel. We implement a lightweight step of personalization of client models to provide high performance for their respective data distributions. Our accuracy far exceeds that of current algorithms SL and is very close to that of centralized learning on several real-life benchmarks.
arXiv Detail & Related papers (2023-03-19T10:38:29Z)
Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training [88.80694147730883]
We investigate a variety of Modality-Shared Contrastive Language-Image Pre-training (MS-CLIP) frameworks. In studied conditions, we observe that a mostly unified encoder for vision and language signals outperforms all other variations that separate more parameters. Our approach outperforms vanilla CLIP by 1.6 points in linear probing on a collection of 24 downstream vision tasks.
arXiv Detail & Related papers (2022-07-26T05:19:16Z)
Federated Multi-Target Domain Adaptation [99.93375364579484]
Federated learning methods enable us to train machine learning models on distributed user data while preserving its privacy. We consider a more practical scenario where the distributed client data is unlabeled, and a centralized labeled dataset is available on the server. We propose an effective DualAdapt method to address the new challenges.
arXiv Detail & Related papers (2021-08-17T17:53:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.