Does Progress On Object Recognition Benchmarks Improve Real-World
Generalization?
- URL: http://arxiv.org/abs/2307.13136v1
- Date: Mon, 24 Jul 2023 21:29:48 GMT
- Title: Does Progress On Object Recognition Benchmarks Improve Real-World
Generalization?
- Authors: Megan Richards, Polina Kirichenko, Diane Bouchacourt, Mark Ibrahim
- Abstract summary: Researchers have measured progress in object recognition on ImageNet-based generalization benchmarks such as ImageNet-A, -C, and -R for more than a decade.
Recent advances in foundation models, trained on orders of magnitude more data, have begun to saturate these standard benchmarks, but remain brittle in practice.
We propose studying generalization across geography as a more realistic measure of progress using two datasets of objects from households across the globe.
- Score: 9.906591021385303
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: For more than a decade, researchers have measured progress in object
recognition on ImageNet-based generalization benchmarks such as ImageNet-A, -C,
and -R. Recent advances in foundation models, trained on orders of magnitude
more data, have begun to saturate these standard benchmarks, but remain brittle
in practice. This suggests standard benchmarks, which tend to focus on
predefined or synthetic changes, may not be sufficient for measuring real world
generalization. Consequently, we propose studying generalization across
geography as a more realistic measure of progress using two datasets of objects
from households across the globe. We conduct an extensive empirical evaluation
of progress across nearly 100 vision models up to most recent foundation
models. We first identify a progress gap between standard benchmarks and
real-world, geographical shifts: progress on ImageNet results in up to 2.5x
more progress on standard generalization benchmarks than real-world
distribution shifts. Second, we study model generalization across geographies
by measuring the disparities in performance across regions, a more fine-grained
measure of real world generalization. We observe all models have large
geographic disparities, even foundation CLIP models, with differences of 7-20%
in accuracy between regions. Counter to modern intuition, we discover progress
on standard benchmarks fails to improve geographic disparities and often
exacerbates them: geographic disparities between the least performant models
and today's best models have more than tripled. Our results suggest scaling
alone is insufficient for consistent robustness to real-world distribution
shifts. Finally, we highlight in early experiments how simple last layer
retraining on more representative, curated data can complement scaling as a
promising direction of future work, reducing geographic disparity on both
benchmarks by over two-thirds.
Related papers
- Diverse Perspectives, Divergent Models: Cross-Cultural Evaluation of Depression Detection on Twitter [4.462334751640166]
We evaluate the generalization of benchmark datasets to build AI models on cross-cultural Twitter data.
Our results show that depression detection models do not generalize globally.
Pre-trained language models achieve the best generalization compared to Logistic Regression, though still show significant gaps in performance on depressed and non-Western users.
arXiv Detail & Related papers (2024-04-01T03:59:12Z) - Strong but simple: A Baseline for Domain Generalized Dense Perception by CLIP-based Transfer Learning [6.532114018212791]
Fine-tuning vision-language pre-trained models yields competitive or even stronger generalization results.
This challenges the standard of using ImageNet-based transfer learning for domain generalization.
We also find improved in-domain generalization, leading to an improved SOTA of 86.4% mIoU on the Cityscapes test set.
arXiv Detail & Related papers (2023-12-04T16:46:38Z) - Scaling Laws Do Not Scale [54.72120385955072]
Recent work has argued that as the size of a dataset increases, the performance of a model trained on that dataset will increase.
We argue that this scaling law relationship depends on metrics used to measure performance that may not correspond with how different groups of people perceive the quality of models' output.
Different communities may also have values in tension with each other, leading to difficult, potentially irreconcilable choices about metrics used for model evaluations.
arXiv Detail & Related papers (2023-07-05T15:32:21Z) - GeoNet: Benchmarking Unsupervised Adaptation across Geographies [71.23141626803287]
We study the problem of geographic robustness and make three main contributions.
First, we introduce a large-scale dataset GeoNet for geographic adaptation.
Second, we hypothesize that the major source of domain shifts arise from significant variations in scene context.
Third, we conduct an extensive evaluation of several state-of-the-art unsupervised domain adaptation algorithms and architectures.
arXiv Detail & Related papers (2023-03-27T17:59:34Z) - Fairness meets Cross-Domain Learning: a new perspective on Models and
Metrics [80.07271410743806]
We study the relationship between cross-domain learning (CD) and model fairness.
We introduce a benchmark on face and medical images spanning several demographic groups as well as classification and localization tasks.
Our study covers 14 CD approaches alongside three state-of-the-art fairness algorithms and shows how the former can outperform the latter.
arXiv Detail & Related papers (2023-03-25T09:34:05Z) - Image Classification with Small Datasets: Overview and Benchmark [0.0]
We systematically organize and connect past studies to consolidate a community that is currently fragmented and scattered.
We propose a common benchmark that allows for an objective comparison of approaches.
We use this benchmark to re-evaluate the standard cross-entropy baseline and ten existing methods published between 2017 and 2021 at renowned venues.
arXiv Detail & Related papers (2022-12-23T17:11:16Z) - 3DGazeNet: Generalizing Gaze Estimation with Weak-Supervision from
Synthetic Views [67.00931529296788]
We propose to train general gaze estimation models which can be directly employed in novel environments without adaptation.
We create a large-scale dataset of diverse faces with gaze pseudo-annotations, which we extract based on the 3D geometry of the scene.
We test our method in the task of gaze generalization, in which we demonstrate improvement of up to 30% compared to state-of-the-art when no ground truth data are available.
arXiv Detail & Related papers (2022-12-06T14:15:17Z) - Improving Visual Grounding by Encouraging Consistent Gradient-based
Explanations [58.442103936918805]
We show that Attention Mask Consistency produces superior visual grounding results than previous methods.
AMC is effective, easy to implement, and is general as it can be adopted by any vision-language model.
arXiv Detail & Related papers (2022-06-30T17:55:12Z) - Newer is not always better: Rethinking transferability metrics, their
peculiarities, stability and performance [5.650647159993238]
Fine-tuning of large pre-trained image and language models on small customized datasets has become increasingly popular.
We show that the statistical problems with covariance estimation drive the poor performance of H-score.
We propose a correction and recommend measuring correlation performance against relative accuracy in such settings.
arXiv Detail & Related papers (2021-10-13T17:24:12Z) - Do Fine-tuned Commonsense Language Models Really Generalize? [8.591839265985412]
We study the generalization issue in detail by designing and conducting a rigorous scientific study.
We find clear evidence that fine-tuned commonsense language models still do not generalize well, even with moderate changes to the experimental setup.
arXiv Detail & Related papers (2020-11-18T08:52:49Z) - Learning Meta Face Recognition in Unseen Domains [74.69681594452125]
We propose a novel face recognition method via meta-learning named Meta Face Recognition (MFR)
MFR synthesizes the source/target domain shift with a meta-optimization objective.
We propose two benchmarks for generalized face recognition evaluation.
arXiv Detail & Related papers (2020-03-17T14:10:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.