Foundational Models Defining a New Era in Vision: A Survey and Outlook
- URL: http://arxiv.org/abs/2307.13721v1
- Date: Tue, 25 Jul 2023 17:59:18 GMT
- Title: Foundational Models Defining a New Era in Vision: A Survey and Outlook
- Authors: Muhammad Awais, Muzammal Naseer, Salman Khan, Rao Muhammad Anwer,
Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Fahad Shahbaz Khan
- Abstract summary: Vision systems to see and reason about the compositional nature of visual scenes are fundamental to understanding our world.
The models learned to bridge the gap between such modalities coupled with large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time.
The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions.
- Score: 151.49434496615427
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision systems to see and reason about the compositional nature of visual
scenes are fundamental to understanding our world. The complex relations
between objects and their locations, ambiguities, and variations in the
real-world environment can be better described in human language, naturally
governed by grammatical rules and other modalities such as audio and depth. The
models learned to bridge the gap between such modalities coupled with
large-scale training data facilitate contextual reasoning, generalization, and
prompt capabilities at test time. These models are referred to as foundational
models. The output of such models can be modified through human-provided
prompts without retraining, e.g., segmenting a particular object by providing a
bounding box, having interactive dialogues by asking questions about an image
or video scene or manipulating the robot's behavior through language
instructions. In this survey, we provide a comprehensive review of such
emerging foundational models, including typical architecture designs to combine
different modalities (vision, text, audio, etc), training objectives
(contrastive, generative), pre-training datasets, fine-tuning mechanisms, and
the common prompting patterns; textual, visual, and heterogeneous. We discuss
the open challenges and research directions for foundational models in computer
vision, including difficulties in their evaluations and benchmarking, gaps in
their real-world understanding, limitations of their contextual understanding,
biases, vulnerability to adversarial attacks, and interpretability issues. We
review recent developments in this field, covering a wide range of applications
of foundation models systematically and comprehensively. A comprehensive list
of foundational models studied in this work is available at
\url{https://github.com/awaisrauf/Awesome-CV-Foundational-Models}.
Related papers
- One-Shot Open Affordance Learning with Foundation Models [54.15857111929812]
We introduce One-shot Open Affordance Learning (OOAL), where a model is trained with just one example per base object category.
We propose a vision-language framework with simple and effective designs that boost the alignment between visual features and affordance text embeddings.
Experiments on two affordance segmentation benchmarks show that the proposed method outperforms state-of-the-art models with less than 1% of the full training data.
arXiv Detail & Related papers (2023-11-29T16:23:06Z) - Transcending the Attention Paradigm: Representation Learning from
Geospatial Social Media Data [1.8311821879979955]
This study challenges the paradigm of performance benchmarking by investigating social media data as a source of distributed patterns.
To properly represent these abstract relationships, this research dissects empirical social media corpora into their elemental components, analyzing over two billion tweets across population-dense locations.
arXiv Detail & Related papers (2023-10-09T03:27:05Z) - Feature Interactions Reveal Linguistic Structure in Language Models [2.0178765779788495]
We study feature interactions in the context of feature attribution methods for post-hoc interpretability.
We work out a grey box methodology, in which we train models to perfection on a formal language classification task.
We show that under specific configurations, some methods are indeed able to uncover the grammatical rules acquired by a model.
arXiv Detail & Related papers (2023-06-21T11:24:41Z) - Foundation Models for Decision Making: Problems, Methods, and
Opportunities [124.79381732197649]
Foundation models pretrained on diverse data at scale have demonstrated extraordinary capabilities in a wide range of vision and language tasks.
New paradigms are emerging for training foundation models to interact with other agents and perform long-term reasoning.
Research at the intersection of foundation models and decision making holds tremendous promise for creating powerful new systems.
arXiv Detail & Related papers (2023-03-07T18:44:07Z) - An Overview on Controllable Text Generation via Variational
Auto-Encoders [15.97186478109836]
Recent advances in neural-based generative modeling have reignited the hopes of having computer systems capable of conversing with humans.
Latent variable models (LVM) such as variational auto-encoders (VAEs) are designed to characterize the distributional pattern of textual data.
This overview gives an introduction to existing generation schemes, problems associated with text variational auto-encoders, and a review of several applications about the controllable generation.
arXiv Detail & Related papers (2022-11-15T07:36:11Z) - Perceptual Grouping in Contrastive Vision-Language Models [59.1542019031645]
We show how vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery.
We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information.
arXiv Detail & Related papers (2022-10-18T17:01:35Z) - Causal Reasoning Meets Visual Representation Learning: A Prospective
Study [117.08431221482638]
Lack of interpretability, robustness, and out-of-distribution generalization are becoming the challenges of the existing visual models.
Inspired by the strong inference ability of human-level agents, recent years have witnessed great effort in developing causal reasoning paradigms.
This paper aims to provide a comprehensive overview of this emerging field, attract attention, encourage discussions, bring to the forefront the urgency of developing novel causal reasoning methods.
arXiv Detail & Related papers (2022-04-26T02:22:28Z) - Visually grounded models of spoken language: A survey of datasets,
architectures and evaluation techniques [15.906959137350247]
This survey provides an overview of the evolution of visually grounded models of spoken language over the last 20 years.
We discuss the central research questions addressed, the timeline of developments, and the datasets which enabled much of this work.
arXiv Detail & Related papers (2021-04-27T14:32:22Z) - Behind the Scene: Revealing the Secrets of Pre-trained
Vision-and-Language Models [65.19308052012858]
Recent Transformer-based large-scale pre-trained models have revolutionized vision-and-language (V+L) research.
We present VALUE, a set of meticulously designed probing tasks to decipher the inner workings of multimodal pre-training.
Key observations: Pre-trained models exhibit a propensity for attending over text rather than images during inference.
arXiv Detail & Related papers (2020-05-15T01:06:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.