DEFT: Decompositional Efficient Fine-Tuning for Text-to-Image Models
- URL: http://arxiv.org/abs/2509.22793v1
- Date: Fri, 26 Sep 2025 18:01:15 GMT
- Title: DEFT: Decompositional Efficient Fine-Tuning for Text-to-Image Models
- Authors: Komal Kumar, Rao Muhammad Anwer, Fahad Shahbaz Khan, Salman Khan, Ivan Laptev, Hisham Cholakkal,
- Abstract summary: DEFT, Decompositional Efficient Fine-Tuning, adapts a pre-trained weight matrix by decomposing its update into two components.<n>We conduct experiments on the Dreambooth and Dreambench Plus datasets for personalization, the InsDet dataset for object and scene adaptation, and the VisualCloze dataset for a universal image generation framework.
- Score: 103.18486625853099
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Efficient fine-tuning of pre-trained Text-to-Image (T2I) models involves adjusting the model to suit a particular task or dataset while minimizing computational resources and limiting the number of trainable parameters. However, it often faces challenges in striking a trade-off between aligning with the target distribution: learning a novel concept from a limited image for personalization and retaining the instruction ability needed for unifying multiple tasks, all while maintaining editability (aligning with a variety of prompts or in-context generation). In this work, we introduce DEFT, Decompositional Efficient Fine-Tuning, an efficient fine-tuning framework that adapts a pre-trained weight matrix by decomposing its update into two components with two trainable matrices: (1) a projection onto the complement of a low-rank subspace spanned by a low-rank matrix, and (2) a low-rank update. The single trainable low-rank matrix defines the subspace, while the other trainable low-rank matrix enables flexible parameter adaptation within that subspace. We conducted extensive experiments on the Dreambooth and Dreambench Plus datasets for personalization, the InsDet dataset for object and scene adaptation, and the VisualCloze dataset for a universal image generation framework through visual in-context learning with both Stable Diffusion and a unified model. Our results demonstrated state-of-the-art performance, highlighting the emergent properties of efficient fine-tuning. Our code is available on \href{https://github.com/MAXNORM8650/DEFT}{DEFTBase}.
Related papers
- From Editor to Dense Geometry Estimator [77.21804448599009]
We introduce textbfFE2E, a framework that adapts an advanced editing model based on Diffusion Transformer (DiT) architecture for dense geometry prediction.<n>FE2E achieves over 35% performance gains on the ETH3D dataset and outperforms the DepthAnything series, which is trained on 100$times$ data.
arXiv Detail & Related papers (2025-09-04T15:58:50Z) - Efficient Multi-Instance Generation with Janus-Pro-Dirven Prompt Parsing [53.295515505026096]
Janus-Pro-driven Prompt Parsing is a prompt- parsing module that bridges text understanding and layout generation.<n>MIGLoRA is a parameter-efficient plug-in integrating Low-Rank Adaptation into UNet (SD1.5) and DiT (SD3) backbones.<n>The proposed method achieves state-of-the-art performance on COCO and LVIS benchmarks while maintaining parameter efficiency.
arXiv Detail & Related papers (2025-03-27T00:59:14Z) - DreamOmni: Unified Image Generation and Editing [51.45871494724542]
We introduce Dream Omni, a unified model for image generation and editing.<n>For training, Dream Omni jointly trains T2I generation and downstream tasks.<n>This collaboration significantly boosts editing performance.
arXiv Detail & Related papers (2024-12-22T17:17:28Z) - Improving General Text Embedding Model: Tackling Task Conflict and Data Imbalance through Model Merging [33.23758947497205]
Advanced embedding models are typically developed using large-scale multi-task data and joint training across multiple tasks.
To overcome these challenges, we explore model merging-a technique that combines independently trained models to mitigate gradient conflicts and balance data distribution.
We introduce a novel method, Self Positioning, which efficiently searches for optimal model combinations within the space of task vectors using gradient descent.
arXiv Detail & Related papers (2024-10-19T08:39:21Z) - PanAdapter: Two-Stage Fine-Tuning with Spatial-Spectral Priors Injecting for Pansharpening [8.916207546866048]
We propose an efficient fine-tuning method, namely PanAdapter, to alleviate the issue of small-scale datasets in pansharpening tasks.
We fine-tune the pre-trained CNN model and extract task-specific priors at two scales by proposed Local Prior Extraction (LPE) module.
We demonstrate that our approach can benefit from pre-trained image restoration models and achieve state-of-the-art performance in several benchmark pansharpening datasets.
arXiv Detail & Related papers (2024-09-11T03:13:08Z) - Dynamic Data Mixing Maximizes Instruction Tuning for Mixture-of-Experts [20.202031878825153]
We propose a novel dynamic data mixture for MoE instruction tuning.
Inspired by MoE's token routing preference, we build dataset-level representations and then capture the subtle differences among datasets.
Results on two MoE models demonstrate the effectiveness of our approach on both downstream knowledge & reasoning tasks and open-ended queries.
arXiv Detail & Related papers (2024-06-17T06:47:03Z) - Pink: Unveiling the Power of Referential Comprehension for Multi-modal
LLMs [49.88461345825586]
This paper proposes a new framework to enhance the fine-grained image understanding abilities of MLLMs.
We present a new method for constructing the instruction tuning dataset at a low cost by leveraging annotations in existing datasets.
We show that our model exhibits a 5.2% accuracy improvement over Qwen-VL and surpasses the accuracy of Kosmos-2 by 24.7%.
arXiv Detail & Related papers (2023-10-01T05:53:15Z) - Bilevel Fast Scene Adaptation for Low-Light Image Enhancement [50.639332885989255]
Enhancing images in low-light scenes is a challenging but widely concerned task in the computer vision.
Main obstacle lies in the modeling conundrum from distribution discrepancy across different scenes.
We introduce the bilevel paradigm to model the above latent correspondence.
A bilevel learning framework is constructed to endow the scene-irrelevant generality of the encoder towards diverse scenes.
arXiv Detail & Related papers (2023-06-02T08:16:21Z) - Sample-Efficient Personalization: Modeling User Parameters as Low Rank
Plus Sparse Components [30.32486162748558]
Personalization of machine learning (ML) predictions for individual users/domains/enterprises is critical for practical recommendation systems.
We propose a novel meta-learning style approach that models network weights as a sum of low-rank and sparse components.
We show that AMHT-LRS solves the problem efficiently with nearly optimal sample complexity.
arXiv Detail & Related papers (2022-10-07T12:50:34Z) - FiT: Parameter Efficient Few-shot Transfer Learning for Personalized and
Federated Image Classification [47.24770508263431]
We develop FiLM Transfer (FiT) which fulfills requirements in the image classification setting.
FiT uses an automatically configured Naive Bayes classifier on top of a fixed backbone that has been pretrained on large image datasets.
We show that FiT achieves better classification accuracy than the state-of-the-art Big Transfer (BiT) algorithm at low-shot and on the challenging VTAB-1k benchmark.
arXiv Detail & Related papers (2022-06-17T10:17:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.