Fine-tuning Multimodal Transformers on Edge: A Parallel Split Learning Approach
- URL: http://arxiv.org/abs/2502.06355v1
- Date: Mon, 10 Feb 2025 11:10:41 GMT
- Title: Fine-tuning Multimodal Transformers on Edge: A Parallel Split Learning Approach
- Authors: Timo Fudala, Vasileios Tsouvalas, Nirvana Meratnia,
- Abstract summary: Split Learning partitions models at a designated cut-layer to offload compute-intensive operations to the server.
We present MPSL, a parallel SL approach for computational efficient fine-tuning of multimodal transformers in a distributed manner.
MPSL employs lightweight client-side tokenizers and a unified modality-agnostic encoder, allowing flexible adaptation to task-specific needs.
- Score: 1.297210402524609
- License:
- Abstract: Multimodal transformers integrate diverse data types like images, audio, and text, advancing tasks such as audio-visual understanding and image-text retrieval; yet their high parameterization limits deployment on resource-constrained edge devices. Split Learning (SL), which partitions models at a designated cut-layer to offload compute-intensive operations to the server, offers a promising approach for distributed training of multimodal transformers, though its application remains underexplored. We present MPSL, a parallel SL approach for computational efficient fine-tuning of multimodal transformers in a distributed manner, while eliminating label sharing, client synchronization, and per-client sub-model management. MPSL employs lightweight client-side tokenizers and a unified modality-agnostic encoder, allowing flexible adaptation to task-specific needs. Our evaluation across 7 multimodal datasets demonstrates that MPSL matches or outperforms Federated Learning, reduces client-side computations by 250x, and achieves superior scalability in communication cost with model growth. Through extensive analysis, we highlight task suitability, trade-offs, and scenarios where MPSL excels, inspiring further exploration.
Related papers
- Adaptive Prototype Knowledge Transfer for Federated Learning with Mixed Modalities and Heterogeneous Tasks [12.67996108615162]
We propose an Adaptive prototype-based Multimodal Federated Learning (AproMFL) framework for mixed modalities and heterogeneous tasks.
Our AproMFL transfers knowledge through adaptively-constructed prototypes without a prior public dataset.
Clients adaptively select prototype construction methods in line with tasks; server converts client prototypes into unified multimodal prototypes and aggregates them to form global prototypes.
arXiv Detail & Related papers (2025-02-06T07:28:05Z) - Pilot: Building the Federated Multimodal Instruction Tuning Framework [79.56362403673354]
Our framework integrates two stages of "adapter on adapter" into the connector of the vision encoder and the LLM.
In stage 1, we extract task-specific features and client-specific features from visual information.
In stage 2, we build the cross-task Mixture-of-Adapters(CT-MoA) module to perform cross-task interaction.
arXiv Detail & Related papers (2025-01-23T07:49:24Z) - StitchFusion: Weaving Any Visual Modalities to Enhance Multimodal Semantic Segmentation [63.31007867379312]
We propose StitchFusion, a framework that integrates large-scale pre-trained models directly as encoders and feature fusers.
We introduce a multi-directional adapter module (MultiAdapter) to enable cross-modal information transfer during encoding.
Our model achieves state-of-the-art performance on four multi-modal segmentation datasets with minimal additional parameters.
arXiv Detail & Related papers (2024-08-02T15:41:16Z) - MP-SL: Multihop Parallel Split Learning [2.7716102039510564]
Multihop Parallel SL (MP-SL) is a modular and Machine Learning as a Service (ML) framework designed to facilitate the involvement of resource-constrained devices.
MP-SL supports multihop Parallel SL-based training. This involves splitting the model into multiple parts and utilizing multiple compute nodes in a pipelined manner.
arXiv Detail & Related papers (2024-01-31T22:09:40Z) - Deformable Mixer Transformer with Gating for Multi-Task Learning of
Dense Prediction [126.34551436845133]
CNNs and Transformers have their own advantages and both have been widely used for dense prediction in multi-task learning (MTL)
We present a novel MTL model by combining both merits of deformable CNN and query-based Transformer with shared gating for multi-task learning of dense prediction.
arXiv Detail & Related papers (2023-08-10T17:37:49Z) - PFSL: Personalized & Fair Split Learning with Data & Label Privacy for
thin clients [0.5144809478361603]
PFSL is a new framework of distributed split learning where a large number of thin clients perform transfer learning in parallel.
We implement a lightweight step of personalization of client models to provide high performance for their respective data distributions.
Our accuracy far exceeds that of current algorithms SL and is very close to that of centralized learning on several real-life benchmarks.
arXiv Detail & Related papers (2023-03-19T10:38:29Z) - Learning Visual Representation from Modality-Shared Contrastive
Language-Image Pre-training [88.80694147730883]
We investigate a variety of Modality-Shared Contrastive Language-Image Pre-training (MS-CLIP) frameworks.
In studied conditions, we observe that a mostly unified encoder for vision and language signals outperforms all other variations that separate more parameters.
Our approach outperforms vanilla CLIP by 1.6 points in linear probing on a collection of 24 downstream vision tasks.
arXiv Detail & Related papers (2022-07-26T05:19:16Z) - Federated Multi-Target Domain Adaptation [99.93375364579484]
Federated learning methods enable us to train machine learning models on distributed user data while preserving its privacy.
We consider a more practical scenario where the distributed client data is unlabeled, and a centralized labeled dataset is available on the server.
We propose an effective DualAdapt method to address the new challenges.
arXiv Detail & Related papers (2021-08-17T17:53:05Z) - Parameter Efficient Multimodal Transformers for Video Representation
Learning [108.8517364784009]
This work focuses on reducing the parameters of multimodal Transformers in the context of audio-visual video representation learning.
We show that our approach reduces parameters up to 80$%$, allowing us to train our model end-to-end from scratch.
To demonstrate our approach, we pretrain our model on 30-second clips from Kinetics-700 and transfer it to audio-visual classification tasks.
arXiv Detail & Related papers (2020-12-08T00:16:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.