Perceive, Ground, Reason, and Act: A Benchmark for General-purpose
Visual Representation
- URL: http://arxiv.org/abs/2211.15402v1
- Date: Mon, 28 Nov 2022 15:06:07 GMT
- Title: Perceive, Ground, Reason, and Act: A Benchmark for General-purpose
Visual Representation
- Authors: Jiangyong Huang, William Yicheng Zhu, Baoxiong Jia, Zan Wang, Xiaojian
Ma, Qing Li, Siyuan Huang
- Abstract summary: Current computer vision models, unlike the human visual system, cannot yet achieve general-purpose visual understanding.
We present a new comprehensive benchmark, General Visual Understanding Evaluation, covering the full spectrum of visual cognitive abilities.
- Score: 26.039045505150526
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current computer vision models, unlike the human visual system, cannot yet
achieve general-purpose visual understanding. Existing efforts to create a
general vision model are limited in the scope of assessed tasks and offer no
overarching framework to perform them holistically. We present a new
comprehensive benchmark, General-purpose Visual Understanding Evaluation
(G-VUE), covering the full spectrum of visual cognitive abilities with four
functional domains $\unicode{x2014}$ Perceive, Ground, Reason, and Act. The
four domains are embodied in 11 carefully curated tasks, from 3D reconstruction
to visual reasoning and manipulation. Along with the benchmark, we provide a
general encoder-decoder framework to allow for the evaluation of arbitrary
visual representation on all 11 tasks. We evaluate various pre-trained visual
representations with our framework and observe that (1) Transformer-based
visual backbone generally outperforms CNN-based backbone on G-VUE, (2) visual
representations from vision-language pre-training are superior to those with
vision-only pre-training across visual tasks. With G-VUE, we provide a holistic
evaluation standard to motivate research toward building general-purpose visual
systems via obtaining more general-purpose visual representations.
Related papers
- VipAct: Visual-Perception Enhancement via Specialized VLM Agent Collaboration and Tool-use [74.39058448757645]
We present VipAct, an agent framework that enhances vision-language models (VLMs)
VipAct consists of an orchestrator agent, which manages task requirement analysis, planning, and coordination, along with specialized agents that handle specific tasks.
We evaluate VipAct on benchmarks featuring a diverse set of visual perception tasks, with experimental results demonstrating significant performance improvements.
arXiv Detail & Related papers (2024-10-21T18:10:26Z) - When Does Perceptual Alignment Benefit Vision Representations? [76.32336818860965]
We investigate how aligning vision model representations to human perceptual judgments impacts their usability.
We find that aligning models to perceptual judgments yields representations that improve upon the original backbones across many downstream tasks.
Our results suggest that injecting an inductive bias about human perceptual knowledge into vision models can contribute to better representations.
arXiv Detail & Related papers (2024-10-14T17:59:58Z) - AVA: Towards Autonomous Visualization Agents through Visual
Perception-Driven Decision-Making [19.09644604789813]
We develop Autonomous Visualization Agents (AVAs) that can interpret and accomplish user-defined visualization objectives through natural language.
The addition of visual perception allows AVAs to act as the virtual visualization assistant for domain experts who may lack the knowledge or expertise in fine-tuning visualization outputs.
Our study indicates that AVAs represent a general paradigm for designing intelligent visualization systems that can achieve high-level visualization goals.
arXiv Detail & Related papers (2023-12-07T18:13:42Z) - What Makes Pre-Trained Visual Representations Successful for Robust
Manipulation? [57.92924256181857]
We find that visual representations designed for manipulation and control tasks do not necessarily generalize under subtle changes in lighting and scene texture.
We find that emergent segmentation ability is a strong predictor of out-of-distribution generalization among ViT models.
arXiv Detail & Related papers (2023-11-03T18:09:08Z) - Does Visual Pretraining Help End-to-End Reasoning? [81.4707017038019]
We investigate whether end-to-end learning of visual reasoning can be achieved with general-purpose neural networks.
We propose a simple and general self-supervised framework which "compresses" each video frame into a small set of tokens.
We observe that pretraining is essential to achieve compositional generalization for end-to-end visual reasoning.
arXiv Detail & Related papers (2023-07-17T14:08:38Z) - Top-Down Visual Attention from Analysis by Synthesis [87.47527557366593]
We consider top-down attention from a classic Analysis-by-Synthesis (AbS) perspective of vision.
We propose Analysis-by-Synthesis Vision Transformer (AbSViT), which is a top-down modulated ViT model that variationally approximates AbS, and controllable achieves top-down attention.
arXiv Detail & Related papers (2023-03-23T05:17:05Z) - Peripheral Vision Transformer [52.55309200601883]
We take a biologically inspired approach and explore to model peripheral vision in deep neural networks for visual recognition.
We propose to incorporate peripheral position encoding to the multi-head self-attention layers to let the network learn to partition the visual field into diverse peripheral regions given training data.
We evaluate the proposed network, dubbed PerViT, on the large-scale ImageNet dataset and systematically investigate the inner workings of the model for machine perception.
arXiv Detail & Related papers (2022-06-14T12:47:47Z) - GAMR: A Guided Attention Model for (visual) Reasoning [7.919213739992465]
Humans continue to outperform modern AI systems in their ability to flexibly parse and understand complex visual scenes.
We present a novel module for visual reasoning, the Guided Attention Model for (visual) Reasoning (GAMR)
GAMR posits that the brain solves complex visual reasoning problems dynamically via sequences of attention shifts to select and route task-relevant visual information into memory.
arXiv Detail & Related papers (2022-06-10T07:52:06Z) - GRIT: General Robust Image Task Benchmark [32.556726698322755]
We introduce the General Robust Image Task (GRIT) benchmark.
GRIT evaluates the performance, robustness, and calibration of a vision system across a variety of image prediction tasks, concepts, and data sources.
By providing a unified platform for thorough assessment of skills and concepts learned by a vision model, we hope GRIT catalyzes the development of performant and robust general purpose vision systems.
arXiv Detail & Related papers (2022-04-28T17:13:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.