Related papers: GRIT: General Robust Image Task Benchmark

GRIT: General Robust Image Task Benchmark

URL: http://arxiv.org/abs/2204.13653v1
Date: Thu, 28 Apr 2022 17:13:23 GMT
Title: GRIT: General Robust Image Task Benchmark
Authors: Tanmay Gupta, Ryan Marten, Aniruddha Kembhavi, Derek Hoiem
Abstract summary: We introduce the General Robust Image Task (GRIT) benchmark. GRIT evaluates the performance, robustness, and calibration of a vision system across a variety of image prediction tasks, concepts, and data sources. By providing a unified platform for thorough assessment of skills and concepts learned by a vision model, we hope GRIT catalyzes the development of performant and robust general purpose vision systems.
Score: 32.556726698322755
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Computer vision models excel at making predictions when the test distribution closely resembles the training distribution. Such models have yet to match the ability of biological vision to learn from multiple sources and generalize to new data sources and tasks. To facilitate the development and evaluation of more general vision systems, we introduce the General Robust Image Task (GRIT) benchmark. GRIT evaluates the performance, robustness, and calibration of a vision system across a variety of image prediction tasks, concepts, and data sources. The seven tasks in GRIT are selected to cover a range of visual skills: object categorization, object localization, referring expression grounding, visual question answering, segmentation, human keypoint detection, and surface normal estimation. GRIT is carefully designed to enable the evaluation of robustness under image perturbations, image source distribution shift, and concept distribution shift. By providing a unified platform for thorough assessment of skills and concepts learned by a vision model, we hope GRIT catalyzes the development of performant and robust general purpose vision systems.

Related papers

Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation [53.84282335629258]
We introduce a comprehensive fine-grained evaluation benchmark, i.e., FG-BMK, comprising 3.49 million questions and 3.32 million images. Our evaluation systematically examines LVLMs from both human-oriented and machine-oriented perspectives. We uncover key findings regarding the influence of training paradigms, modality alignment, perturbation susceptibility, and fine-grained category reasoning on task performance.
arXiv Detail & Related papers (2025-04-21T09:30:41Z)
Towards Understanding Graphical Perception in Large Multimodal Models [80.44471730672801]
We leverage the theory of graphical perception to develop an evaluation framework for analyzing gaps in LMMs' perception abilities in charts. We apply our framework to evaluate and diagnose the perception capabilities of state-of-the-art LMMs at three levels (chart, visual element, and pixel)
arXiv Detail & Related papers (2025-03-13T20:13:39Z)
VisGraphVar: A Benchmark Generator for Assessing Variability in Graph Analysis Using Large Vision-Language Models [1.597617022056624]
Large Vision-Language Models (LVLMs) are increasingly capable of tackling abstract visual tasks. We introduce VisGraphVar, a customizable benchmark generator able to produce graph images for seven task categories. We show that variations in visual attributes of images (e.g., node labeling and layout) and the deliberate inclusion of visual imperfections significantly affect model performance.
arXiv Detail & Related papers (2024-11-22T10:10:53Z)
When Does Perceptual Alignment Benefit Vision Representations? [76.32336818860965]
We investigate how aligning vision model representations to human perceptual judgments impacts their usability. We find that aligning models to perceptual judgments yields representations that improve upon the original backbones across many downstream tasks. Our results suggest that injecting an inductive bias about human perceptual knowledge into vision models can contribute to better representations.
arXiv Detail & Related papers (2024-10-14T17:59:58Z)
General Object Foundation Model for Images and Videos at Scale [99.2806103051613]
We present GLEE, an object-level foundation model for locating and identifying objects in images and videos. GLEE accomplishes detection, segmentation, tracking, grounding, and identification of arbitrary objects in the open world scenario. We employ an image encoder, text encoder, and visual prompter to handle multi-modal inputs, enabling to simultaneously solve various object-centric downstream tasks.
arXiv Detail & Related papers (2023-12-14T17:26:00Z)
Heuristic Vision Pre-Training with Self-Supervised and Supervised Multi-Task Learning [0.0]
We propose a novel pre-training framework by adopting both self-supervised and supervised visual pre-text tasks in a multi-task manner. Results show that our pre-trained models can deliver results on par with or better than state-of-the-art (SOTA) results on multiple visual tasks.
arXiv Detail & Related papers (2023-10-11T14:06:04Z)
Top-Down Visual Attention from Analysis by Synthesis [87.47527557366593]
We consider top-down attention from a classic Analysis-by-Synthesis (AbS) perspective of vision. We propose Analysis-by-Synthesis Vision Transformer (AbSViT), which is a top-down modulated ViT model that variationally approximates AbS, and controllable achieves top-down attention.
arXiv Detail & Related papers (2023-03-23T05:17:05Z)
Pre-Trained Image Encoder for Generalizable Visual Reinforcement Learning [27.304282924423095]
We propose Pre-trained Image for Generalizable visual reinforcement learning (PIE-G) PIE-G is a simple yet effective framework that can generalize to the unseen visual scenarios in a zero-shot manner. Empirical evidence suggests PIE-G improves sample efficiency and significantly outperforms previous state-of-the-art methods in terms of generalization performance.
arXiv Detail & Related papers (2022-12-17T12:45:08Z)
Perceive, Ground, Reason, and Act: A Benchmark for General-purpose Visual Representation [26.039045505150526]
Current computer vision models, unlike the human visual system, cannot yet achieve general-purpose visual understanding. We present a new comprehensive benchmark, General Visual Understanding Evaluation, covering the full spectrum of visual cognitive abilities.
arXiv Detail & Related papers (2022-11-28T15:06:07Z)
Exploring CLIP for Assessing the Look and Feel of Images [87.97623543523858]
We introduce Contrastive Language-Image Pre-training (CLIP) models for assessing both the quality perception (look) and abstract perception (feel) of images in a zero-shot manner. Our results show that CLIP captures meaningful priors that generalize well to different perceptual assessments.
arXiv Detail & Related papers (2022-07-25T17:58:16Z)
Peripheral Vision Transformer [52.55309200601883]
We take a biologically inspired approach and explore to model peripheral vision in deep neural networks for visual recognition. We propose to incorporate peripheral position encoding to the multi-head self-attention layers to let the network learn to partition the visual field into diverse peripheral regions given training data. We evaluate the proposed network, dubbed PerViT, on the large-scale ImageNet dataset and systematically investigate the inner workings of the model for machine perception.
arXiv Detail & Related papers (2022-06-14T12:47:47Z)
Fairness Indicators for Systematic Assessments of Visual Feature Extractors [21.141633753573764]
We propose three fairness indicators, which aim at quantifying harms and biases of visual systems. Our indicators use existing publicly available datasets collected for fairness evaluations. These indicators are not intended to be a substitute for a thorough analysis of the broader impact of the new computer vision technologies.
arXiv Detail & Related papers (2022-02-15T17:45:33Z)
Generative Hierarchical Features from Synthesizing Images [65.66756821069124]
We show that learning to synthesize images can bring remarkable hierarchical visual features that are generalizable across a wide range of applications. The visual feature produced by our encoder, termed as Generative Hierarchical Feature (GH-Feat), has strong transferability to both generative and discriminative tasks.
arXiv Detail & Related papers (2020-07-20T18:04:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.