Related papers: The Role of ImageNet Classes in Fr\'echet Inception Distance

The Role of ImageNet Classes in Fr\'echet Inception Distance

URL: http://arxiv.org/abs/2203.06026v1
Date: Fri, 11 Mar 2022 15:50:06 GMT
Title: The Role of ImageNet Classes in Fr\'echet Inception Distance
Authors: Tuomas Kynk\"a\"anniemi, Tero Karras, Miika Aittala, Timo Aila, Jaakko Lehtinen
Abstract summary: Inception Distance (FID) is a metric for quantifying the distance between two distributions of images. We observe that FID is essentially a distance between sets of ImageNet class probabilities. Our results suggest caution against over-interpreting FID improvements, and underline the need for distribution metrics that are more perceptually uniform.
Score: 33.47601032254247
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Fr\'echet Inception Distance (FID) is a metric for quantifying the distance between two distributions of images. Given its status as a standard yardstick for ranking models in data-driven generative modeling research, it seems important that the distance is computed from general, "vision-related" features. But is it? We observe that FID is essentially a distance between sets of ImageNet class probabilities. We trace the reason to the fact that the standard feature space, the penultimate "pre-logit" layer of a particular Inception-V3 classifier network, is only one affine transform away from the logits, i.e., ImageNet classes, and thus, the features are necessarily highly specialized to them. This has unintuitive consequences for the metric's sensitivity. For example, when evaluating a model for human faces, we observe that, on average, FID is actually very insensitive to the facial region, and that the probabilities of classes like "bow tie" or "seat belt" play a much larger role. Further, we show that FID can be significantly reduced -- without actually improving the quality of results -- by an attack that first generates a slightly larger set of candidates, and then chooses a subset that happens to match the histogram of such "fringe features" in the real data. We then demonstrate that this observation has practical relevance in case of ImageNet pre-training of GANs, where a part of the observed FID improvement turns out not to be real. Our results suggest caution against over-interpreting FID improvements, and underline the need for distribution metrics that are more perceptually uniform.

Related papers

Normalizing Flow-Based Metric for Image Generation [4.093503153499691]
We propose two new evaluation metrics to assess realness of generated images based on normalizing flows. Because normalizing flows can be used to compute the exact likelihood, the proposed metrics assess how closely generated images align with the distribution of real images from a given domain.
arXiv Detail & Related papers (2024-10-02T20:09:58Z)
Rethinking FID: Towards a Better Evaluation Metric for Image Generation [43.66036053597747]
Inception Distance estimates the distance between a distribution of Inception-v3 features of real images, and those of images generated by the algorithm. We highlight important drawbacks of FID: Inception's poor representation of the rich and varied content generated by modern text-to-image models, incorrect normality assumptions, and poor sample complexity. We propose an alternative new metric, CMMD, based on richer CLIP embeddings and the maximum mean discrepancy distance with the Gaussian RBF kernel.
arXiv Detail & Related papers (2023-11-30T19:11:01Z)
Using Skew to Assess the Quality of GAN-generated Image Features [3.300324211572204]
The Fr'echetInception Distance (FID) has been widely adopted due to its conceptual simplicity, fast computation time, and strong correlation with human perception. In this paper we explore the importance of third-moments in image feature data and use this information to define a new measure, which we call the Skew Inception Distance (SID)
arXiv Detail & Related papers (2023-10-31T17:05:02Z)
Interpreting CLIP: Insights on the Robustness to ImageNet Distribution Shifts [22.74552390076515]
We probe the representation spaces of 16 robust zero-shot CLIP vision encoders with various backbones and pretraining sets. We detect the presence of outlier features in robust zero-shot CLIP vision encoders, which to the best of our knowledge is the first time these are observed in non-transformer models. We find the existence of outlier features to be an indication of ImageNet shift robustness in models, since we only find them in robust models in our analysis.
arXiv Detail & Related papers (2023-10-19T17:59:12Z)
ALSO: Automotive Lidar Self-supervision by Occupancy estimation [70.70557577874155]
We propose a new self-supervised method for pre-training the backbone of deep perception models operating on point clouds. The core idea is to train the model on a pretext task which is the reconstruction of the surface on which the 3D points are sampled. The intuition is that if the network is able to reconstruct the scene surface, given only sparse input points, then it probably also captures some fragments of semantic information.
arXiv Detail & Related papers (2022-12-12T13:10:19Z)
Location-Aware Self-Supervised Transformers [74.76585889813207]
We propose to pretrain networks for semantic segmentation by predicting the relative location of image parts. We control the difficulty of the task by masking a subset of the reference patch features visible to those of the query. Our experiments show that this location-aware pretraining leads to representations that transfer competitively to several challenging semantic segmentation benchmarks.
arXiv Detail & Related papers (2022-12-05T16:24:29Z)
Scale Attention for Learning Deep Face Representation: A Study Against Visual Scale Variation [69.45176408639483]
We reform the conv layer by resorting to the scale-space theory. We build a novel style named SCale AttentioN Conv Neural Network (textbfSCAN-CNN) As a single-shot scheme, the inference is more efficient than multi-shot fusion.
arXiv Detail & Related papers (2022-09-19T06:35:04Z)
Visual Recognition with Deep Nearest Centroids [57.35144702563746]
We devise deep nearest centroids (DNC), a conceptually elegant yet surprisingly effective network for large-scale visual recognition. Compared with parametric counterparts, DNC performs better on image classification (CIFAR-10, ImageNet) and greatly boots pixel recognition (ADE20K, Cityscapes)
arXiv Detail & Related papers (2022-09-15T15:47:31Z)
Glance and Focus Networks for Dynamic Visual Recognition [36.26856080976052]
We formulate the image recognition problem as a sequential coarse-to-fine feature learning process, mimicking the human visual system. The proposed Glance and Focus Network (GFNet) first extracts a quick global representation of the input image at a low resolution scale, and then strategically attends to a series of salient (small) regions to learn finer features. It reduces the average latency of the highly efficient MobileNet-V3 on an iPhone XS Max by 1.3x without sacrificing accuracy.
arXiv Detail & Related papers (2022-01-09T14:00:56Z)
Group-Wise Semantic Mining for Weakly Supervised Semantic Segmentation [49.90178055521207]
This work addresses weakly supervised semantic segmentation (WSSS), with the goal of bridging the gap between image-level annotations and pixel-level segmentation. We formulate WSSS as a novel group-wise learning task that explicitly models semantic dependencies in a group of images to estimate more reliable pseudo ground-truths. In particular, we devise a graph neural network (GNN) for group-wise semantic mining, wherein input images are represented as graph nodes.
arXiv Detail & Related papers (2020-12-09T12:40:13Z)
Disp R-CNN: Stereo 3D Object Detection via Shape Prior Guided Instance Disparity Estimation [51.17232267143098]
We propose a novel system named Disp R-CNN for 3D object detection from stereo images. We use a statistical shape model to generate dense disparity pseudo-ground-truth without the need of LiDAR point clouds. Experiments on the KITTI dataset show that, even when LiDAR ground-truth is not available at training time, Disp R-CNN achieves competitive performance and outperforms previous state-of-the-art methods by 20% in terms of average precision.
arXiv Detail & Related papers (2020-04-07T17:48:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.