Foundational Models Defining a New Era in Vision: A Survey and Outlook
- URL: http://arxiv.org/abs/2307.13721v1
- Date: Tue, 25 Jul 2023 17:59:18 GMT
- Title: Foundational Models Defining a New Era in Vision: A Survey and Outlook
- Authors: Muhammad Awais, Muzammal Naseer, Salman Khan, Rao Muhammad Anwer,
Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Fahad Shahbaz Khan
- Abstract summary: Vision systems to see and reason about the compositional nature of visual scenes are fundamental to understanding our world.
The models learned to bridge the gap between such modalities coupled with large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time.
The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions.
- Score: 151.49434496615427
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision systems to see and reason about the compositional nature of visual
scenes are fundamental to understanding our world. The complex relations
between objects and their locations, ambiguities, and variations in the
real-world environment can be better described in human language, naturally
governed by grammatical rules and other modalities such as audio and depth. The
models learned to bridge the gap between such modalities coupled with
large-scale training data facilitate contextual reasoning, generalization, and
prompt capabilities at test time. These models are referred to as foundational
models. The output of such models can be modified through human-provided
prompts without retraining, e.g., segmenting a particular object by providing a
bounding box, having interactive dialogues by asking questions about an image
or video scene or manipulating the robot's behavior through language
instructions. In this survey, we provide a comprehensive review of such
emerging foundational models, including typical architecture designs to combine
different modalities (vision, text, audio, etc), training objectives
(contrastive, generative), pre-training datasets, fine-tuning mechanisms, and
the common prompting patterns; textual, visual, and heterogeneous. We discuss
the open challenges and research directions for foundational models in computer
vision, including difficulties in their evaluations and benchmarking, gaps in
their real-world understanding, limitations of their contextual understanding,
biases, vulnerability to adversarial attacks, and interpretability issues. We
review recent developments in this field, covering a wide range of applications
of foundation models systematically and comprehensively. A comprehensive list
of foundational models studied in this work is available at
\url{https://github.com/awaisrauf/Awesome-CV-Foundational-Models}.
Related papers
- VISTA: A Visual and Textual Attention Dataset for Interpreting Multimodal Models [2.0718016474717196]
integrated Vision and Language Models (VLMs) are frequently regarded as black boxes within the machine learning research community.
We present an image-text aligned human visual attention dataset that maps specific associations between image regions and corresponding text segments.
We then compare the internal heatmaps generated by VL models with this dataset, allowing us to analyze and better understand the model's decision-making process.
arXiv Detail & Related papers (2024-10-06T20:11:53Z) - ARPA: A Novel Hybrid Model for Advancing Visual Word Disambiguation Using Large Language Models and Transformers [1.6541870997607049]
We present ARPA, an architecture that fuses the unparalleled contextual understanding of large language models with the advanced feature extraction capabilities of transformers.
ARPA's introduction marks a significant milestone in visual word disambiguation, offering a compelling solution.
We invite researchers and practitioners to explore the capabilities of our model, envisioning a future where such hybrid models drive unprecedented advancements in artificial intelligence.
arXiv Detail & Related papers (2024-08-12T10:15:13Z) - One-Shot Open Affordance Learning with Foundation Models [54.15857111929812]
We introduce One-shot Open Affordance Learning (OOAL), where a model is trained with just one example per base object category.
We propose a vision-language framework with simple and effective designs that boost the alignment between visual features and affordance text embeddings.
Experiments on two affordance segmentation benchmarks show that the proposed method outperforms state-of-the-art models with less than 1% of the full training data.
arXiv Detail & Related papers (2023-11-29T16:23:06Z) - Feature Interactions Reveal Linguistic Structure in Language Models [2.0178765779788495]
We study feature interactions in the context of feature attribution methods for post-hoc interpretability.
We work out a grey box methodology, in which we train models to perfection on a formal language classification task.
We show that under specific configurations, some methods are indeed able to uncover the grammatical rules acquired by a model.
arXiv Detail & Related papers (2023-06-21T11:24:41Z) - Foundation Models for Decision Making: Problems, Methods, and
Opportunities [124.79381732197649]
Foundation models pretrained on diverse data at scale have demonstrated extraordinary capabilities in a wide range of vision and language tasks.
New paradigms are emerging for training foundation models to interact with other agents and perform long-term reasoning.
Research at the intersection of foundation models and decision making holds tremendous promise for creating powerful new systems.
arXiv Detail & Related papers (2023-03-07T18:44:07Z) - Perceptual Grouping in Contrastive Vision-Language Models [59.1542019031645]
We show how vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery.
We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information.
arXiv Detail & Related papers (2022-10-18T17:01:35Z) - Causal Reasoning Meets Visual Representation Learning: A Prospective
Study [117.08431221482638]
Lack of interpretability, robustness, and out-of-distribution generalization are becoming the challenges of the existing visual models.
Inspired by the strong inference ability of human-level agents, recent years have witnessed great effort in developing causal reasoning paradigms.
This paper aims to provide a comprehensive overview of this emerging field, attract attention, encourage discussions, bring to the forefront the urgency of developing novel causal reasoning methods.
arXiv Detail & Related papers (2022-04-26T02:22:28Z) - Visually grounded models of spoken language: A survey of datasets,
architectures and evaluation techniques [15.906959137350247]
This survey provides an overview of the evolution of visually grounded models of spoken language over the last 20 years.
We discuss the central research questions addressed, the timeline of developments, and the datasets which enabled much of this work.
arXiv Detail & Related papers (2021-04-27T14:32:22Z) - Behind the Scene: Revealing the Secrets of Pre-trained
Vision-and-Language Models [65.19308052012858]
Recent Transformer-based large-scale pre-trained models have revolutionized vision-and-language (V+L) research.
We present VALUE, a set of meticulously designed probing tasks to decipher the inner workings of multimodal pre-training.
Key observations: Pre-trained models exhibit a propensity for attending over text rather than images during inference.
arXiv Detail & Related papers (2020-05-15T01:06:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.