Strong but simple: A Baseline for Domain Generalized Dense Perception by CLIP-based Transfer Learning
- URL: http://arxiv.org/abs/2312.02021v3
- Date: Wed, 30 Oct 2024 22:58:36 GMT
- Title: Strong but simple: A Baseline for Domain Generalized Dense Perception by CLIP-based Transfer Learning
- Authors: Christoph Hümmer, Manuel Schwonberg, Liangwei Zhou, Hu Cao, Alois Knoll, Hanno Gottschalk,
- Abstract summary: Fine-tuning vision-language pre-trained models yields competitive or even stronger generalization results.
This challenges the standard of using ImageNet-based transfer learning for domain generalization.
We also find improved in-domain generalization, leading to an improved SOTA of 86.4% mIoU on the Cityscapes test set.
- Score: 6.532114018212791
- License:
- Abstract: Domain generalization (DG) remains a significant challenge for perception based on deep neural networks (DNNs), where domain shifts occur due to synthetic data, lighting, weather, or location changes. Vision-language models (VLMs) marked a large step for the generalization capabilities and have been already applied to various tasks. Very recently, first approaches utilized VLMs for domain generalized segmentation and object detection and obtained strong generalization. However, all these approaches rely on complex modules, feature augmentation frameworks or additional models. Surprisingly and in contrast to that, we found that simple fine-tuning of vision-language pre-trained models yields competitive or even stronger generalization results while being extremely simple to apply. Moreover, we found that vision-language pre-training consistently provides better generalization than the previous standard of vision-only pre-training. This challenges the standard of using ImageNet-based transfer learning for domain generalization. Fully fine-tuning a vision-language pre-trained model is capable of reaching the domain generalization SOTA when training on the synthetic GTA5 dataset. Moreover, we confirm this observation for object detection on a novel synthetic-to-real benchmark. We further obtain superior generalization capabilities by reaching 77.9% mIoU on the popular Cityscapes-to-ACDC benchmark. We also found improved in-domain generalization, leading to an improved SOTA of 86.4% mIoU on the Cityscapes test set marking the first place on the leaderboard.
Related papers
- A Study on Unsupervised Domain Adaptation for Semantic Segmentation in the Era of Vision-Language Models [1.2499537119440245]
Domain shifts are one of the major challenges in deep learning based computer vision.
UDA methods have emerged which adapt a model to a new target domain by only using unlabeled data of that domain.
Recent vision-language models have demonstrated strong generalization capabilities which may facilitate domain adaptation.
We show that replacing the encoder of existing UDA methods by a vision-language pre-trained encoder can result in significant performance improvements.
arXiv Detail & Related papers (2024-11-25T14:12:24Z) - Vision Superalignment: Weak-to-Strong Generalization for Vision
Foundation Models [55.919653720979824]
This paper focuses on the concept of weak-to-strong generalization, which involves using a weaker model to supervise a stronger one.
We introduce a novel and adaptively adjustable loss function for weak-to-strong supervision.
Our approach not only exceeds the performance benchmarks set by strong-to-strong generalization but also surpasses the outcomes of fine-tuning strong models with whole datasets.
arXiv Detail & Related papers (2024-02-06T06:30:34Z) - Multi-Scale and Multi-Layer Contrastive Learning for Domain Generalization [5.124256074746721]
We argue that the generalization ability of deep convolutional neural networks can be improved by taking advantage of multi-layer and multi-scaled representations of the network.
We introduce a framework that aims at improving domain generalization of image classifiers by combining both low-level and high-level features at multiple scales.
We show that our model is able to surpass the performance of previous DG methods and consistently produce competitive and state-of-the-art results in all datasets.
arXiv Detail & Related papers (2023-08-28T08:54:27Z) - Augmentation-based Domain Generalization for Semantic Segmentation [2.179313476241343]
Unsupervised Domain Adaptation (UDA) and domain generalization (DG) aim to tackle the lack of generalization of Deep Neural Networks (DNNs) towards unseen domains.
We study the in- and out-of-domain generalization capabilities of simple, rule-based image augmentations like blur, noise, color jitter and many more.
Our experiments confirm the common scientific standard that combination of multiple different augmentations out-performs single augmentations.
arXiv Detail & Related papers (2023-04-24T14:26:53Z) - TFS-ViT: Token-Level Feature Stylization for Domain Generalization [17.82872117103924]
Vision Transformers (ViTs) have shown outstanding performance for a broad range of computer vision tasks.
This paper presents a first Token-level Feature Stylization (TFS-ViT) approach for domain generalization.
Our approach transforms token features by mixing the normalization statistics of images from different domains.
arXiv Detail & Related papers (2023-03-28T03:00:28Z) - When Neural Networks Fail to Generalize? A Model Sensitivity Perspective [82.36758565781153]
Domain generalization (DG) aims to train a model to perform well in unseen domains under different distributions.
This paper considers a more realistic yet more challenging scenario, namely Single Domain Generalization (Single-DG)
We empirically ascertain a property of a model that correlates strongly with its generalization that we coin as "model sensitivity"
We propose a novel strategy of Spectral Adversarial Data Augmentation (SADA) to generate augmented images targeted at the highly sensitive frequencies.
arXiv Detail & Related papers (2022-12-01T20:15:15Z) - Improving Visual Grounding by Encouraging Consistent Gradient-based
Explanations [58.442103936918805]
We show that Attention Mask Consistency produces superior visual grounding results than previous methods.
AMC is effective, easy to implement, and is general as it can be adopted by any vision-language model.
arXiv Detail & Related papers (2022-06-30T17:55:12Z) - Towards Principled Disentanglement for Domain Generalization [90.9891372499545]
A fundamental challenge for machine learning models is generalizing to out-of-distribution (OOD) data.
We first formalize the OOD generalization problem as constrained optimization, called Disentanglement-constrained Domain Generalization (DDG)
Based on the transformation, we propose a primal-dual algorithm for joint representation disentanglement and domain generalization.
arXiv Detail & Related papers (2021-11-27T07:36:32Z) - Adversarially Adaptive Normalization for Single Domain Generalization [71.80587939738672]
We propose a generic normalization approach, adaptive standardization and rescaling normalization (ASR-Norm)
ASR-Norm learns both the standardization and rescaling statistics via neural networks.
We show that ASR-Norm can bring consistent improvement to the state-of-the-art ADA approaches.
arXiv Detail & Related papers (2021-06-01T23:58:23Z) - Learning Meta Face Recognition in Unseen Domains [74.69681594452125]
We propose a novel face recognition method via meta-learning named Meta Face Recognition (MFR)
MFR synthesizes the source/target domain shift with a meta-optimization objective.
We propose two benchmarks for generalized face recognition evaluation.
arXiv Detail & Related papers (2020-03-17T14:10:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.