Unifying (Machine) Vision via Counterfactual World Modeling
- URL: http://arxiv.org/abs/2306.01828v1
- Date: Fri, 2 Jun 2023 17:45:44 GMT
- Title: Unifying (Machine) Vision via Counterfactual World Modeling
- Authors: Daniel M. Bear, Kevin Feigelis, Honglin Chen, Wanhee Lee, Rahul
Venkatesh, Klemen Kotar, Alex Durango, Daniel L.K. Yamins
- Abstract summary: We introduce Counterfactual World Modeling (CWM), a framework for constructing a visual foundation model.
CWM has two key components, which resolve the core issues that have hindered application of the foundation model concept to vision.
We show that CWM generates high-quality readouts on real-world images and videos for a diversity of tasks.
- Score: 5.001446411351483
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Leading approaches in machine vision employ different architectures for
different tasks, trained on costly task-specific labeled datasets. This
complexity has held back progress in areas, such as robotics, where robust
task-general perception remains a bottleneck. In contrast, "foundation models"
of natural language have shown how large pre-trained neural networks can
provide zero-shot solutions to a broad spectrum of apparently distinct tasks.
Here we introduce Counterfactual World Modeling (CWM), a framework for
constructing a visual foundation model: a unified, unsupervised network that
can be prompted to perform a wide variety of visual computations. CWM has two
key components, which resolve the core issues that have hindered application of
the foundation model concept to vision. The first is structured masking, a
generalization of masked prediction methods that encourages a prediction model
to capture the low-dimensional structure in visual data. The model thereby
factors the key physical components of a scene and exposes an interface to them
via small sets of visual tokens. This in turn enables CWM's second main idea --
counterfactual prompting -- the observation that many apparently distinct
visual representations can be computed, in a zero-shot manner, by comparing the
prediction model's output on real inputs versus slightly modified
("counterfactual") inputs. We show that CWM generates high-quality readouts on
real-world images and videos for a diversity of tasks, including estimation of
keypoints, optical flow, occlusions, object segments, and relative depth. Taken
together, our results show that CWM is a promising path to unifying the
manifold strands of machine vision in a conceptually simple foundation.
Related papers
- LaVin-DiT: Large Vision Diffusion Transformer [99.98106406059333]
LaVin-DiT is a scalable and unified foundation model designed to tackle over 20 computer vision tasks in a generative framework.
We introduce key innovations to optimize generative performance for vision tasks.
The model is scaled from 0.1B to 3.4B parameters, demonstrating substantial scalability and state-of-the-art performance across diverse vision tasks.
arXiv Detail & Related papers (2024-11-18T12:05:27Z) - Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities [30.176918208200604]
Vision-Language Models (VLMs) have emerged as general purpose tools for addressing a variety of complex computer vision problems.
These models have been shown to be highly capable, but also lacking some basic visual understanding skills.
This paper sets out to understand the limitations of SoTA VLMs on fundamental visual tasks.
arXiv Detail & Related papers (2024-08-13T08:26:32Z) - Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [56.391404083287235]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.
Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.
We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z) - Neural Clustering based Visual Representation Learning [61.72646814537163]
Clustering is one of the most classic approaches in machine learning and data analysis.
We propose feature extraction with clustering (FEC), which views feature extraction as a process of selecting representatives from data.
FEC alternates between grouping pixels into individual clusters to abstract representatives and updating the deep features of pixels with current representatives.
arXiv Detail & Related papers (2024-03-26T06:04:50Z) - Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model [83.85856356798531]
VistaLLM is a visual system that addresses coarse- and fine-grained vision-language tasks.
It employs a gradient-aware adaptive sampling technique to represent binary segmentation masks as sequences.
We also introduce a novel task, AttCoSeg, which boosts the model's reasoning and grounding capability over multiple input images.
arXiv Detail & Related papers (2023-12-19T18:53:01Z) - Heuristic Vision Pre-Training with Self-Supervised and Supervised
Multi-Task Learning [0.0]
We propose a novel pre-training framework by adopting both self-supervised and supervised visual pre-text tasks in a multi-task manner.
Results show that our pre-trained models can deliver results on par with or better than state-of-the-art (SOTA) results on multiple visual tasks.
arXiv Detail & Related papers (2023-10-11T14:06:04Z) - UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes [91.24112204588353]
We introduce UViM, a unified approach capable of modeling a wide range of computer vision tasks.
In contrast to previous models, UViM has the same functional form for all tasks.
We demonstrate the effectiveness of UViM on three diverse and challenging vision tasks.
arXiv Detail & Related papers (2022-05-20T17:47:59Z) - MulT: An End-to-End Multitask Learning Transformer [66.52419626048115]
We propose an end-to-end Multitask Learning Transformer framework, named MulT, to simultaneously learn multiple high-level vision tasks.
Our framework encodes the input image into a shared representation and makes predictions for each vision task using task-specific transformer-based decoder heads.
arXiv Detail & Related papers (2022-05-17T13:03:18Z) - Contextual Encoder-Decoder Network for Visual Saliency Prediction [42.047816176307066]
We propose an approach based on a convolutional neural network pre-trained on a large-scale image classification task.
We combine the resulting representations with global scene information for accurately predicting visual saliency.
Compared to state of the art approaches, the network is based on a lightweight image classification backbone.
arXiv Detail & Related papers (2019-02-18T16:15:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.