A model for full local image interpretation
- URL: http://arxiv.org/abs/2110.08744v1
- Date: Sun, 17 Oct 2021 07:20:53 GMT
- Title: A model for full local image interpretation
- Authors: Guy Ben-Yosef, Liav Assif, Daniel Harari, Shimon Ullman
- Abstract summary: We describe a computational model of humans' ability to provide a detailed interpretation of components in a scene.
Our model suggests that this is a fundamental limitation, related to the fact that existing models rely on feed-forward but limited top-down processing.
We discuss implications of the model for visual interpretation by humans and by computer vision models.
- Score: 8.048166434189522
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We describe a computational model of humans' ability to provide a detailed
interpretation of components in a scene. Humans can identify in an image
meaningful components almost everywhere, and identifying these components is an
essential part of the visual process, and of understanding the surrounding
scene and its potential meaning to the viewer. Detailed interpretation is
beyond the scope of current models of visual recognition. Our model suggests
that this is a fundamental limitation, related to the fact that existing models
rely on feed-forward but limited top-down processing. In our model, a first
recognition stage leads to the initial activation of class candidates, which is
incomplete and with limited accuracy. This stage then triggers the application
of class-specific interpretation and validation processes, which recover richer
and more accurate interpretation of the visible scene. We discuss implications
of the model for visual interpretation by humans and by computer vision models.
Related papers
- One-Shot Open Affordance Learning with Foundation Models [54.15857111929812]
We introduce One-shot Open Affordance Learning (OOAL), where a model is trained with just one example per base object category.
We propose a vision-language framework with simple and effective designs that boost the alignment between visual features and affordance text embeddings.
Experiments on two affordance segmentation benchmarks show that the proposed method outperforms state-of-the-art models with less than 1% of the full training data.
arXiv Detail & Related papers (2023-11-29T16:23:06Z) - Foundational Models Defining a New Era in Vision: A Survey and Outlook [151.49434496615427]
Vision systems to see and reason about the compositional nature of visual scenes are fundamental to understanding our world.
The models learned to bridge the gap between such modalities coupled with large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time.
The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions.
arXiv Detail & Related papers (2023-07-25T17:59:18Z) - Seeing in Words: Learning to Classify through Language Bottlenecks [59.97827889540685]
Humans can explain their predictions using succinct and intuitive descriptions.
We show that a vision model whose feature representations are text can effectively classify ImageNet images.
arXiv Detail & Related papers (2023-06-29T00:24:42Z) - Understanding Self-Supervised Pretraining with Part-Aware Representation
Learning [88.45460880824376]
We study the capability that self-supervised representation pretraining methods learn part-aware representations.
Results show that the fully-supervised model outperforms self-supervised models for object-level recognition.
arXiv Detail & Related papers (2023-01-27T18:58:42Z) - Localization vs. Semantics: Visual Representations in Unimodal and
Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models.
Our empirical observations suggest that vision-and-language models are better at label prediction tasks.
We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z) - Understanding Cross-modal Interactions in V&L Models that Generate Scene
Descriptions [3.7957452405531256]
This paper explores the potential of a state-of-the-art Vision and Language model, VinVL, to caption images at the scene level.
We show (3) that a small amount of curated data suffices to generate scene descriptions without losing the capability to identify object-level concepts in the scene.
We discuss the parallels between these results and insights from computational and cognitive science research on scene perception.
arXiv Detail & Related papers (2022-11-09T15:33:51Z) - Perceptual Grouping in Contrastive Vision-Language Models [59.1542019031645]
We show how vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery.
We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information.
arXiv Detail & Related papers (2022-10-18T17:01:35Z) - Right for the Right Concept: Revising Neuro-Symbolic Concepts by
Interacting with their Explanations [24.327862278556445]
We propose a Neuro-Symbolic scene representation, which allows one to revise the model on the semantic level.
The results of our experiments on CLEVR-Hans demonstrate that our semantic explanations can identify confounders.
arXiv Detail & Related papers (2020-11-25T16:23:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.