Related papers: Unsupervised Domain Adaption Harnessing Vision-Language Pre-training

Unsupervised Domain Adaption Harnessing Vision-Language Pre-training

URL: http://arxiv.org/abs/2408.02192v1
Date: Mon, 5 Aug 2024 02:37:59 GMT
Title: Unsupervised Domain Adaption Harnessing Vision-Language Pre-training
Authors: Wenlve Zhou, Zhiheng Zhou,
Abstract summary: This paper focuses on harnessing the power of Vision-Language Pre-training models in Unsupervised Domain Adaptation (UDA) We propose a novel method called Cross-Modal Knowledge Distillation (CMKD) Our proposed method outperforms existing techniques on standard benchmarks.
Score: 4.327763441385371
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper addresses two vital challenges in Unsupervised Domain Adaptation (UDA) with a focus on harnessing the power of Vision-Language Pre-training (VLP) models. Firstly, UDA has primarily relied on ImageNet pre-trained models. However, the potential of VLP models in UDA remains largely unexplored. The rich representation of VLP models holds significant promise for enhancing UDA tasks. To address this, we propose a novel method called Cross-Modal Knowledge Distillation (CMKD), leveraging VLP models as teacher models to guide the learning process in the target domain, resulting in state-of-the-art performance. Secondly, current UDA paradigms involve training separate models for each task, leading to significant storage overhead and impractical model deployment as the number of transfer tasks grows. To overcome this challenge, we introduce Residual Sparse Training (RST) exploiting the benefits conferred by VLP's extensive pre-training, a technique that requires minimal adjustment (approximately 0.1\%$\sim$0.5\%) of VLP model parameters to achieve performance comparable to fine-tuning. Combining CMKD and RST, we present a comprehensive solution that effectively leverages VLP models for UDA tasks while reducing storage overhead for model deployment. Furthermore, CMKD can serve as a baseline in conjunction with other methods like FixMatch, enhancing the performance of UDA. Our proposed method outperforms existing techniques on standard benchmarks. Our code will be available at: https://github.com/Wenlve-Zhou/VLP-UDA.

Related papers

Efficient Domain-Adaptive Multi-Task Dense Prediction with Vision Foundation Models [8.197984309863314]
In this paper, we introduce FAMDA, a simple yet effective UDA framework that bridges this gap by leveraging Vision Foundation Models (VFMs) as powerful teachers.<n>Our approach integrates foundation models into a self-training paradigm to generate high-quality pseudo-labels for the target domain.<n>Experiments show that FAMDA achieves state-of-the-art (SOTA) performance on standard synthetic-to-real UDA multi-task learning benchmarks and a challenging new day-to-night adaptation task.
arXiv Detail & Related papers (2025-09-28T04:02:36Z)
V.I.P. : Iterative Online Preference Distillation for Efficient Video Diffusion Models [14.301804388786469]
We propose an effective distillation method, ReDPO, that integrates DPO and SFT.<n>Our approach leverages DPO to guide the student model to focus on recovering only the targeted properties, rather than passively imitating the teacher.<n>We additionally propose V.I.P., a novel framework for filtering and curating high-quality pair datasets, along with a step-by-step online approach for calibrated training.
arXiv Detail & Related papers (2025-08-05T09:31:54Z)
EdgeVLA: Efficient Vision-Language-Action Models [0.4005096060512278]
This paper introduces Edge VLA, a novel approach designed to significantly enhance the inference speed of Vision-Language-Action (VLA) models.<n>We achieve this through two key innovations: 1) Eliminating the autoregressive requirement for end-effector position prediction, leading to a 7x speedup in inference, and 2) Leveraging the efficiency of Small Language Models (SLMs)<n>Our early results demonstrate that EVLA achieves comparable training characteristics to OpenVLA while offering substantial gains in inference speed and memory efficiency.
arXiv Detail & Related papers (2025-07-18T16:15:09Z)
Perception-Aware Policy Optimization for Multimodal Reasoning [79.56070395437898]
A major source of error in current multimodal reasoning lies in the perception of visual inputs.<n>We propose PAPO, a novel policy gradient algorithm that encourages the model to learn to perceive while learning to reason.<n>We observe a substantial reduction of 30.5% in perception errors, indicating improved perceptual capabilities with PAPO.
arXiv Detail & Related papers (2025-07-08T23:22:34Z)
Interactive Post-Training for Vision-Language-Action Models [28.32397816792674]
We introduce RIPT-VLA, a simple and scalable reinforcement-learning-based interactive post-training paradigm.<n> RIPT-VLA fine-tunes pretrained Vision-Language-Action (VLA) models using only sparse binary success rewards.<n>With only one demonstration, RIPT-VLA enables an unworkable SFT model to succeed with a 97% success rate within 15 iterations.
arXiv Detail & Related papers (2025-05-22T17:59:45Z)
Simple yet Effective Semi-supervised Knowledge Distillation from Vision-Language Models via Dual-Head Optimization [47.38380084735716]
vision-supervised models (VLMs), pre-trained on massive image-text pairs, have demonstrated remarkable zero-/few-shot performance.<n>Knowledge distillation (KD) offers a natural framework for transferring VLM capabilities, but it suffers from gradient conflicts between supervised and distillation losses.<n>We propose Dual-Head Optimization (DHO), which introduces dual prediction heads for each distinct signal.
arXiv Detail & Related papers (2025-05-12T15:39:51Z)
DUDA: Distilled Unsupervised Domain Adaptation for Lightweight Semantic Segmentation [9.568820012635355]
Unsupervised Domain Adaptation (UDA) is essential for enabling semantic segmentation in new domains without requiring costly pixel-wise annotations. We propose Distilled Unsupervised Domain Adaptation (DUDA), a novel framework that combines EMA-based self-training with knowledge distillation. Our method employs an auxiliary student network to bridge the architectural gap between heavyweight and lightweight models for EMA-based updates.
arXiv Detail & Related papers (2025-04-14T02:30:18Z)
Meta-Learning Adaptable Foundation Models [37.458141335750696]
We introduce a meta-learning framework infused with PEFT in this intermediate retraining stage to learn a model that can be easily adapted to unseen tasks. In this setting, we demonstrate the suboptimality of standard retraining for finding an adaptable set of parameters. We then apply these theoretical insights to retraining the RoBERTa model to predict the continuation of conversations within the ConvAI2 dataset.
arXiv Detail & Related papers (2024-10-29T17:24:18Z)
Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization [77.62516752323207]
We introduce an orthogonal fine-tuning method for efficiently fine-tuning pretrained weights and enabling enhanced robustness and generalization. A self-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed OrthSR. For the first time, we revisit the CLIP and CoOp with our method to effectively improve the model on few-shot image classficiation scenario.
arXiv Detail & Related papers (2024-07-11T10:35:53Z)
Towards Seamless Adaptation of Pre-trained Models for Visual Place Recognition [72.35438297011176]
We propose a novel method to realize seamless adaptation of pre-trained models for visual place recognition (VPR) Specifically, to obtain both global and local features that focus on salient landmarks for discriminating places, we design a hybrid adaptation method. Experimental results show that our method outperforms the state-of-the-art methods with less training data and training time.
arXiv Detail & Related papers (2024-02-22T12:55:01Z)
FD-Align: Feature Discrimination Alignment for Fine-tuning Pre-Trained Models in Few-Shot Learning [21.693779973263172]
In this paper, we introduce a fine-tuning approach termed Feature Discrimination Alignment (FD-Align) Our method aims to bolster the model's generalizability by preserving the consistency of spurious features. Once fine-tuned, the model can seamlessly integrate with existing methods, leading to performance improvements.
arXiv Detail & Related papers (2023-10-23T17:12:01Z)
Parameter and Computation Efficient Transfer Learning for Vision-Language Pre-trained Models [79.34513906324727]
In this paper, we aim at parameter and efficient transfer learning (PCETL) for vision-language pre-trained models. We propose a novel dynamic architecture skipping (DAS) approach towards effective PCETL.
arXiv Detail & Related papers (2023-09-04T09:34:33Z)
Open-Set Domain Adaptation with Visual-Language Foundation Models [51.49854335102149]
Unsupervised domain adaptation (UDA) has proven to be very effective in transferring knowledge from a source domain to a target domain with unlabeled data. Open-set domain adaptation (ODA) has emerged as a potential solution to identify these classes during the training phase.
arXiv Detail & Related papers (2023-07-30T11:38:46Z)
Approximated Prompt Tuning for Vision-Language Pre-trained Models [54.326232586461614]
In vision-language pre-trained models, prompt tuning often requires a large number of learnable tokens to bridge the gap between the pre-training and downstream tasks. We propose a novel Approximated Prompt Tuning (APT) approach towards efficient VL transfer learning.
arXiv Detail & Related papers (2023-06-27T05:43:47Z)
Guiding The Last Layer in Federated Learning with Pre-Trained Models [18.382057374270143]
Federated Learning (FL) is an emerging paradigm that allows a model to be trained across a number of participants without sharing data. We show that fitting a classification head using the Nearest Class Means (NCM) can be done exactly and orders of magnitude more efficiently than existing proposals.
arXiv Detail & Related papers (2023-06-06T18:02:02Z)
ZhichunRoad at Amazon KDD Cup 2022: MultiTask Pre-Training for E-Commerce Product Search [4.220439000486713]
We propose a robust multilingual model to improve the quality of search results. In pre-training stage, we adopt mlm task, classification task and contrastive learning task. In fine-tuning stage, we use confident learning, exponential moving average method (EMA), adversarial training (FGM) and regularized dropout strategy (R-Drop)
arXiv Detail & Related papers (2023-01-31T07:31:34Z)
DL-DRL: A double-level deep reinforcement learning approach for large-scale task scheduling of multi-UAV [65.07776277630228]
We propose a double-level deep reinforcement learning (DL-DRL) approach based on a divide and conquer framework (DCF) Particularly, we design an encoder-decoder structured policy network in our upper-level DRL model to allocate the tasks to different UAVs. We also exploit another attention based policy network in our lower-level DRL model to construct the route for each UAV, with the objective to maximize the number of executed tasks.
arXiv Detail & Related papers (2022-08-04T04:35:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.