GRIT: General Robust Image Task Benchmark
- URL: http://arxiv.org/abs/2204.13653v1
- Date: Thu, 28 Apr 2022 17:13:23 GMT
- Title: GRIT: General Robust Image Task Benchmark
- Authors: Tanmay Gupta, Ryan Marten, Aniruddha Kembhavi, Derek Hoiem
- Abstract summary: We introduce the General Robust Image Task (GRIT) benchmark.
GRIT evaluates the performance, robustness, and calibration of a vision system across a variety of image prediction tasks, concepts, and data sources.
By providing a unified platform for thorough assessment of skills and concepts learned by a vision model, we hope GRIT catalyzes the development of performant and robust general purpose vision systems.
- Score: 32.556726698322755
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Computer vision models excel at making predictions when the test distribution
closely resembles the training distribution. Such models have yet to match the
ability of biological vision to learn from multiple sources and generalize to
new data sources and tasks. To facilitate the development and evaluation of
more general vision systems, we introduce the General Robust Image Task (GRIT)
benchmark. GRIT evaluates the performance, robustness, and calibration of a
vision system across a variety of image prediction tasks, concepts, and data
sources. The seven tasks in GRIT are selected to cover a range of visual
skills: object categorization, object localization, referring expression
grounding, visual question answering, segmentation, human keypoint detection,
and surface normal estimation. GRIT is carefully designed to enable the
evaluation of robustness under image perturbations, image source distribution
shift, and concept distribution shift. By providing a unified platform for
thorough assessment of skills and concepts learned by a vision model, we hope
GRIT catalyzes the development of performant and robust general purpose vision
systems.
Related papers
- Towards Graph Foundation Models: Learning Generalities Across Graphs via Task-Trees [50.78679002846741]
We introduce a novel approach for learning cross-task generalities in graphs.
We propose task-trees as basic learning instances to align task spaces on graphs.
Our findings indicate that when a graph neural network is pretrained on diverse task-trees, it acquires transferable knowledge.
arXiv Detail & Related papers (2024-12-21T02:07:43Z) - VisGraphVar: A Benchmark Generator for Assessing Variability in Graph Analysis Using Large Vision-Language Models [1.597617022056624]
Large Vision-Language Models (LVLMs) are increasingly capable of tackling abstract visual tasks.
We introduce VisGraphVar, a customizable benchmark generator able to produce graph images for seven task categories.
We show that variations in visual attributes of images (e.g., node labeling and layout) and the deliberate inclusion of visual imperfections significantly affect model performance.
arXiv Detail & Related papers (2024-11-22T10:10:53Z) - When Does Perceptual Alignment Benefit Vision Representations? [76.32336818860965]
We investigate how aligning vision model representations to human perceptual judgments impacts their usability.
We find that aligning models to perceptual judgments yields representations that improve upon the original backbones across many downstream tasks.
Our results suggest that injecting an inductive bias about human perceptual knowledge into vision models can contribute to better representations.
arXiv Detail & Related papers (2024-10-14T17:59:58Z) - Heuristic Vision Pre-Training with Self-Supervised and Supervised
Multi-Task Learning [0.0]
We propose a novel pre-training framework by adopting both self-supervised and supervised visual pre-text tasks in a multi-task manner.
Results show that our pre-trained models can deliver results on par with or better than state-of-the-art (SOTA) results on multiple visual tasks.
arXiv Detail & Related papers (2023-10-11T14:06:04Z) - Top-Down Visual Attention from Analysis by Synthesis [87.47527557366593]
We consider top-down attention from a classic Analysis-by-Synthesis (AbS) perspective of vision.
We propose Analysis-by-Synthesis Vision Transformer (AbSViT), which is a top-down modulated ViT model that variationally approximates AbS, and controllable achieves top-down attention.
arXiv Detail & Related papers (2023-03-23T05:17:05Z) - Pre-Trained Image Encoder for Generalizable Visual Reinforcement
Learning [27.304282924423095]
We propose Pre-trained Image for Generalizable visual reinforcement learning (PIE-G)
PIE-G is a simple yet effective framework that can generalize to the unseen visual scenarios in a zero-shot manner.
Empirical evidence suggests PIE-G improves sample efficiency and significantly outperforms previous state-of-the-art methods in terms of generalization performance.
arXiv Detail & Related papers (2022-12-17T12:45:08Z) - Perceive, Ground, Reason, and Act: A Benchmark for General-purpose
Visual Representation [26.039045505150526]
Current computer vision models, unlike the human visual system, cannot yet achieve general-purpose visual understanding.
We present a new comprehensive benchmark, General Visual Understanding Evaluation, covering the full spectrum of visual cognitive abilities.
arXiv Detail & Related papers (2022-11-28T15:06:07Z) - Exploring CLIP for Assessing the Look and Feel of Images [87.97623543523858]
We introduce Contrastive Language-Image Pre-training (CLIP) models for assessing both the quality perception (look) and abstract perception (feel) of images in a zero-shot manner.
Our results show that CLIP captures meaningful priors that generalize well to different perceptual assessments.
arXiv Detail & Related papers (2022-07-25T17:58:16Z) - Peripheral Vision Transformer [52.55309200601883]
We take a biologically inspired approach and explore to model peripheral vision in deep neural networks for visual recognition.
We propose to incorporate peripheral position encoding to the multi-head self-attention layers to let the network learn to partition the visual field into diverse peripheral regions given training data.
We evaluate the proposed network, dubbed PerViT, on the large-scale ImageNet dataset and systematically investigate the inner workings of the model for machine perception.
arXiv Detail & Related papers (2022-06-14T12:47:47Z) - Fairness Indicators for Systematic Assessments of Visual Feature
Extractors [21.141633753573764]
We propose three fairness indicators, which aim at quantifying harms and biases of visual systems.
Our indicators use existing publicly available datasets collected for fairness evaluations.
These indicators are not intended to be a substitute for a thorough analysis of the broader impact of the new computer vision technologies.
arXiv Detail & Related papers (2022-02-15T17:45:33Z) - Generative Hierarchical Features from Synthesizing Images [65.66756821069124]
We show that learning to synthesize images can bring remarkable hierarchical visual features that are generalizable across a wide range of applications.
The visual feature produced by our encoder, termed as Generative Hierarchical Feature (GH-Feat), has strong transferability to both generative and discriminative tasks.
arXiv Detail & Related papers (2020-07-20T18:04:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.