Finding Differences Between Transformers and ConvNets Using
Counterfactual Simulation Testing
- URL: http://arxiv.org/abs/2211.16499v1
- Date: Tue, 29 Nov 2022 18:59:23 GMT
- Title: Finding Differences Between Transformers and ConvNets Using
Counterfactual Simulation Testing
- Authors: Nataniel Ruiz, Sarah Adel Bargal, Cihang Xie, Kate Saenko, Stan
Sclaroff
- Abstract summary: We present a counterfactual framework that allows us to study the robustness of neural networks with respect to naturalistic variations.
Our method allows for a fair comparison of the robustness of recently released, state-of-the-art Convolutional Neural Networks and Vision Transformers.
- Score: 82.67716657524251
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern deep neural networks tend to be evaluated on static test sets. One
shortcoming of this is the fact that these deep neural networks cannot be
easily evaluated for robustness issues with respect to specific scene
variations. For example, it is hard to study the robustness of these networks
to variations of object scale, object pose, scene lighting and 3D occlusions.
The main reason is that collecting real datasets with fine-grained naturalistic
variations of sufficient scale can be extremely time-consuming and expensive.
In this work, we present Counterfactual Simulation Testing, a counterfactual
framework that allows us to study the robustness of neural networks with
respect to some of these naturalistic variations by building realistic
synthetic scenes that allow us to ask counterfactual questions to the models,
ultimately providing answers to questions such as "Would your classification
still be correct if the object were viewed from the top?" or "Would your
classification still be correct if the object were partially occluded by
another object?". Our method allows for a fair comparison of the robustness of
recently released, state-of-the-art Convolutional Neural Networks and Vision
Transformers, with respect to these naturalistic variations. We find evidence
that ConvNext is more robust to pose and scale variations than Swin, that
ConvNext generalizes better to our simulated domain and that Swin handles
partial occlusion better than ConvNext. We also find that robustness for all
networks improves with network scale and with data scale and variety. We
release the Naturalistic Variation Object Dataset (NVD), a large simulated
dataset of 272k images of everyday objects with naturalistic variations such as
object pose, scale, viewpoint, lighting and occlusions. Project page:
https://counterfactualsimulation.github.io
Related papers
- The Change You Want to See (Now in 3D) [65.61789642291636]
The goal of this paper is to detect what has changed, if anything, between two "in the wild" images of the same 3D scene.
We contribute a change detection model that is trained entirely on synthetic data and is class-agnostic.
We release a new evaluation dataset consisting of real-world image pairs with human-annotated differences.
arXiv Detail & Related papers (2023-08-21T01:59:45Z) - D-IF: Uncertainty-aware Human Digitization via Implicit Distribution
Field [16.301611237147863]
We propose replacing the implicit value with an adaptive uncertainty distribution, to differentiate between points based on their distance to the surface.
This simple value to distribution'' transition yields significant improvements on nearly all the baselines.
Results demonstrate that the models trained using our uncertainty distribution loss, can capture more intricate wrinkles, and realistic limbs.
arXiv Detail & Related papers (2023-08-17T08:31:11Z) - Capsules as viewpoint learners for human pose estimation [4.246061945756033]
We show how most neural networks are not able to generalize well when the camera is subject to significant viewpoint changes.
We propose a novel end-to-end viewpoint-equivariant capsule autoencoder that employs a fast Variational Bayes routing and matrix capsules.
We achieve state-of-the-art results for multiple tasks and datasets while retaining other desirable properties.
arXiv Detail & Related papers (2023-02-13T09:01:46Z) - A Comprehensive Study of Image Classification Model Sensitivity to
Foregrounds, Backgrounds, and Visual Attributes [58.633364000258645]
We call this dataset RIVAL10 consisting of roughly $26k$ instances over $10$ classes.
We evaluate the sensitivity of a broad set of models to noise corruptions in foregrounds, backgrounds and attributes.
In our analysis, we consider diverse state-of-the-art architectures (ResNets, Transformers) and training procedures (CLIP, SimCLR, DeiT, Adversarial Training)
arXiv Detail & Related papers (2022-01-26T06:31:28Z) - Towards Optimal Strategies for Training Self-Driving Perception Models
in Simulation [98.51313127382937]
We focus on the use of labels in the synthetic domain alone.
Our approach introduces both a way to learn neural-invariant representations and a theoretically inspired view on how to sample the data from the simulator.
We showcase our approach on the bird's-eye-view vehicle segmentation task with multi-sensor data.
arXiv Detail & Related papers (2021-11-15T18:37:43Z) - Learning Online Visual Invariances for Novel Objects via Supervised and
Self-Supervised Training [0.76146285961466]
This paper assesses whether standard CNNs can support human-like online invariance by training models to recognize images of synthetic 3D objects that undergo several transformations.
We show that standard supervised CNNs trained on transformed objects can acquire strong invariances on novel classes even when trained with as few as 50 objects taken from 10 classes.
arXiv Detail & Related papers (2021-10-04T14:29:43Z) - Contemplating real-world object classification [53.10151901863263]
We reanalyze the ObjectNet dataset recently proposed by Barbu et al. containing objects in daily life situations.
We find that applying deep models to the isolated objects, rather than the entire scene as is done in the original paper, results in around 20-30% performance improvement.
arXiv Detail & Related papers (2021-03-08T23:29:59Z) - 6D Camera Relocalization in Ambiguous Scenes via Continuous Multimodal
Inference [67.70859730448473]
We present a multimodal camera relocalization framework that captures ambiguities and uncertainties.
We predict multiple camera pose hypotheses as well as the respective uncertainty for each prediction.
We introduce a new dataset specifically designed to foster camera localization research in ambiguous environments.
arXiv Detail & Related papers (2020-04-09T20:55:06Z) - Virtual to Real adaptation of Pedestrian Detectors [9.432150710329607]
ViPeD is a new synthetically generated set of images collected with the graphical engine of the video game GTA V - Grand Theft Auto V.
We propose two different Domain Adaptation techniques suitable for the pedestrian detection task, but possibly applicable to general object detection.
Experiments show that the network trained with ViPeD can generalize over unseen real-world scenarios better than the detector trained over real-world data.
arXiv Detail & Related papers (2020-01-09T14:50:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.