ForAug: Recombining Foregrounds and Backgrounds to Improve Vision Transformer Training with Bias Mitigation
- URL: http://arxiv.org/abs/2503.09399v1
- Date: Wed, 12 Mar 2025 13:49:45 GMT
- Title: ForAug: Recombining Foregrounds and Backgrounds to Improve Vision Transformer Training with Bias Mitigation
- Authors: Tobias Christian Nauen, Brian Moser, Federico Raue, Stanislav Frolov, Andreas Dengel,
- Abstract summary: Transformers have achieved state-of-the-art performance in large-scale image classification.<n>They often require large amounts of data and can exhibit biases that limit their robustness and generalizability.<n>This paper introduces ForAug, a novel data augmentation scheme that explicitly includes inductive biases.
- Score: 7.242733423663421
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers, particularly Vision Transformers (ViTs), have achieved state-of-the-art performance in large-scale image classification. However, they often require large amounts of data and can exhibit biases that limit their robustness and generalizability. This paper introduces ForAug, a novel data augmentation scheme that addresses these challenges and explicitly includes inductive biases, which commonly are part of the neural network architecture, into the training data. ForAug is constructed by using pretrained foundation models to separate and recombine foreground objects with different backgrounds, enabling fine-grained control over image composition during training. It thus increases the data diversity and effective number of training samples. We demonstrate that training on ForNet, the application of ForAug to ImageNet, significantly improves the accuracy of ViTs and other architectures by up to 4.5 percentage points (p.p.) on ImageNet and 7.3 p.p. on downstream tasks. Importantly, ForAug enables novel ways of analyzing model behavior and quantifying biases. Namely, we introduce metrics for background robustness, foreground focus, center bias, and size bias and show that training on ForNet substantially reduces these biases compared to training on ImageNet. In summary, ForAug provides a valuable tool for analyzing and mitigating biases, enabling the development of more robust and reliable computer vision models. Our code and dataset are publicly available at https://github.com/tobna/ForAug.
Related papers
- Data Filtering Networks [67.827994353269]
We study the problem of learning a data filtering network (DFN) for this second step of filtering a large uncurated dataset.
Our key finding is that the quality of a network for filtering is distinct from its performance on downstream tasks.
Based on our insights, we construct new data filtering networks that induce state-of-the-art image-text datasets.
arXiv Detail & Related papers (2023-09-29T17:37:29Z) - A Comprehensive Study of Image Classification Model Sensitivity to
Foregrounds, Backgrounds, and Visual Attributes [58.633364000258645]
We call this dataset RIVAL10 consisting of roughly $26k$ instances over $10$ classes.
We evaluate the sensitivity of a broad set of models to noise corruptions in foregrounds, backgrounds and attributes.
In our analysis, we consider diverse state-of-the-art architectures (ResNets, Transformers) and training procedures (CLIP, SimCLR, DeiT, Adversarial Training)
arXiv Detail & Related papers (2022-01-26T06:31:28Z) - Training Vision Transformers with Only 2040 Images [35.86457465241119]
Vision Transformers (ViTs) is emerging as an alternative to convolutional neural networks (CNNs) for visual recognition.
We give theoretical analyses that our method is superior to other methods in that it can capture both feature alignment and instance similarities.
We achieve state-of-the-art results when training from scratch on 7 small datasets under various ViT backbones.
arXiv Detail & Related papers (2022-01-26T03:22:08Z) - Are Large-scale Datasets Necessary for Self-Supervised Pre-training? [29.49873710927313]
We consider a self-supervised pre-training scenario that only leverages the target task data.
Our study shows that denoising autoencoders, such as BEiT, are more robust to the type and size of the pre-training data.
On COCO, when pre-training solely using COCO images, the detection and instance segmentation performance surpasses the supervised ImageNet pre-training in a comparable setting.
arXiv Detail & Related papers (2021-12-20T18:41:32Z) - Self-Supervised Pre-Training for Transformer-Based Person
Re-Identification [54.55281692768765]
Transformer-based supervised pre-training achieves great performance in person re-identification (ReID)
Due to the domain gap between ImageNet and ReID datasets, it usually needs a larger pre-training dataset to boost the performance.
This work aims to mitigate the gap between the pre-training and ReID datasets from the perspective of data and model structure.
arXiv Detail & Related papers (2021-11-23T18:59:08Z) - How to train your ViT? Data, Augmentation, and Regularization in Vision
Transformers [74.06040005144382]
Vision Transformers (ViT) have been shown to attain highly competitive performance for a wide range of vision applications.
We conduct a systematic empirical study in order to better understand the interplay between the amount of training data, AugReg, model size and compute budget.
We train ViT models of various sizes on the public ImageNet-21k dataset which either match or outperform their counterparts trained on the larger, but not publicly available JFT-300M dataset.
arXiv Detail & Related papers (2021-06-18T17:58:20Z) - Efficient Training of Visual Transformers with Small-Size Datasets [64.60765211331697]
Visual Transformers (VTs) are emerging as an architectural paradigm alternative to Convolutional networks (CNNs)
We show that, despite having a comparable accuracy when trained on ImageNet, their performance on smaller datasets can be largely different.
We propose a self-supervised task which can extract additional information from images with only a negligible computational overhead.
arXiv Detail & Related papers (2021-06-07T16:14:06Z) - When Vision Transformers Outperform ResNets without Pretraining or
Strong Data Augmentations [111.44860506703307]
Vision Transformers (ViTs) and existing VisionNets signal efforts on replacing hand-wired features or inductive throughputs with general-purpose neural architectures.
This paper investigates ViTs and Res-Mixers from the lens of loss geometry, intending to improve the models' data efficiency at training and inference.
We show that the improved robustness attributes to sparser active neurons in the first few layers.
The resultant ViTs outperform Nets of similar size and smoothness when trained from scratch on ImageNet without large-scale pretraining or strong data augmentations.
arXiv Detail & Related papers (2021-06-03T02:08:03Z) - Automated Cleanup of the ImageNet Dataset by Model Consensus,
Explainability and Confident Learning [0.0]
ImageNet was the backbone of various convolutional neural networks (CNNs) trained on ILSVRC12Net.
This paper describes automated applications based on model consensus, explainability and confident learning to correct labeling mistakes.
The ImageNet-Clean improves the model performance by 2-2.4 % for SqueezeNet and EfficientNet-B0 models.
arXiv Detail & Related papers (2021-03-30T13:16:35Z) - Increasing the Robustness of Semantic Segmentation Models with
Painting-by-Numbers [39.95214171175713]
We build upon an insight from image classification that output can be improved by increasing the network-bias towards object shapes.
Our basic idea is to alpha-blend a portion of the RGB training images with faked images, where each class-label is given a fixed, randomly chosen color.
We demonstrate the effectiveness of our training schema for DeepLabv3+ with various network backbones, MobileNet-V2, ResNets, and Xception, and evaluate it on the Cityscapes dataset.
arXiv Detail & Related papers (2020-10-12T07:42:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.