FLoRA: Enhancing Vision-Language Models with Parameter-Efficient Federated Learning
- URL: http://arxiv.org/abs/2404.15182v1
- Date: Fri, 12 Apr 2024 00:36:43 GMT
- Title: FLoRA: Enhancing Vision-Language Models with Parameter-Efficient Federated Learning
- Authors: Duy Phuong Nguyen, J. Pablo Munoz, Ali Jannesari,
- Abstract summary: multimodal models integrate vision and language into visual-language models (VLMs)
This paper proposes a novel approach that leverages Federated Learning and parameter-efficient adapters to train VLMs.
Our approach accelerates training time by up to 34.72 times and requires 2.47 times less memory usage than full fine-tuning.
- Score: 6.648544684097181
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the rapidly evolving field of artificial intelligence, multimodal models, e.g., integrating vision and language into visual-language models (VLMs), have become pivotal for many applications, ranging from image captioning to multimodal search engines. Among these models, the Contrastive Language-Image Pre-training (CLIP) model has demonstrated remarkable performance in understanding and generating nuanced relationships between text and images. However, the conventional training of such models often requires centralized aggregation of vast datasets, posing significant privacy and data governance challenges. To address these concerns, this paper proposes a novel approach that leverages Federated Learning and parameter-efficient adapters, i.e., Low-Rank Adaptation (LoRA), to train VLMs. This methodology preserves data privacy by training models across decentralized data sources and ensures model adaptability and efficiency through LoRA's parameter-efficient fine-tuning. Our approach accelerates training time by up to 34.72 times and requires 2.47 times less memory usage than full fine-tuning.
Related papers
- ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning [38.26304604660713]
ADEM-VL is an efficient vision-language method that tunes models based on pretrained large language models.
Our framework surpasses existing methods by an average accuracy of 0.77% on ScienceQA dataset.
arXiv Detail & Related papers (2024-10-23T11:31:06Z) - Promoting Data and Model Privacy in Federated Learning through Quantized LoRA [41.81020951061438]
We introduce a method that just needs to distribute a quantized version of the model's parameters during training.
We combine this quantization strategy with LoRA, a popular and parameter-efficient fine-tuning method, to significantly reduce communication costs in federated learning.
The proposed framework, named textscFedLPP, successfully ensures both data and model privacy in the federated learning context.
arXiv Detail & Related papers (2024-06-16T15:23:07Z) - Enhancing Large Vision Language Models with Self-Training on Image Comprehension [131.14381425260706]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension.
First, the model self-constructs a preference for image descriptions using unlabeled images.
To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z) - Data-efficient Large Vision Models through Sequential Autoregression [58.26179273091461]
We develop an efficient, autoregression-based vision model on a limited dataset.
We demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding.
Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint.
arXiv Detail & Related papers (2024-02-07T13:41:53Z) - Expedited Training of Visual Conditioned Language Generation via
Redundancy Reduction [61.16125290912494]
$textEVL_textGen$ is a framework designed for the pre-training of visually conditioned language generation models.
We show that our approach accelerates the training of vision-language models by a factor of 5 without a noticeable impact on overall performance.
arXiv Detail & Related papers (2023-10-05T03:40:06Z) - StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized
Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning.
This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models.
Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z) - UniDiff: Advancing Vision-Language Models with Generative and
Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC)
UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z) - Joint Adaptive Representations for Image-Language Learning [59.40890927221377]
We propose a recipe for image-language learning, which produces effective models, outperforming bigger and more expensive ones, often trained on orders of magnitude larger datasets.
Our key finding is the joint learning of a compact vision and language representation, which adaptively and iteratively fuses the multi-modal features.
With only 40M training examples and with 39 GFLOPs our lightweight model outperforms many times larger state-of-the-art models of 2-20x more FLOPs and using bigger datasets some of which with close to 1B training examples.
arXiv Detail & Related papers (2023-05-31T15:02:02Z) - Dynamic Sampling and Selective Masking for Communication-Efficient
Federated Learning [11.511755449420253]
Federated learning (FL) is a novel machine learning setting that enables on-device intelligence via decentralized training and federated optimization.
This paper introduces two approaches for improving communication efficiency by dynamic sampling and top-$k$ selective masking.
arXiv Detail & Related papers (2020-03-21T08:20:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.