Learning Scene Context Without Images
- URL: http://arxiv.org/abs/2311.10998v1
- Date: Sat, 18 Nov 2023 07:27:25 GMT
- Title: Learning Scene Context Without Images
- Authors: Amirreza Rouhi, David Han
- Abstract summary: We introduce a novel approach to teach scene contextual knowledge to machines using an attention mechanism.
A distinctive aspect of the proposed approach is its reliance solely on labels from image datasets to teach scene context.
We show how scene-wide relationships among different objects can be learned using a self-attention mechanism.
- Score: 2.8184014933789365
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Teaching machines of scene contextual knowledge would enable them to interact
more effectively with the environment and to anticipate or predict objects that
may not be immediately apparent in their perceptual field. In this paper, we
introduce a novel transformer-based approach called $LMOD$ ( Label-based
Missing Object Detection) to teach scene contextual knowledge to machines using
an attention mechanism. A distinctive aspect of the proposed approach is its
reliance solely on labels from image datasets to teach scene context, entirely
eliminating the need for the actual image itself. We show how scene-wide
relationships among different objects can be learned using a self-attention
mechanism. We further show that the contextual knowledge gained from label
based learning can enhance performance of other visual based object detection
algorithm.
Related papers
- In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation [50.79940712523551]
We present lazy visual grounding, a two-stage approach of unsupervised object mask discovery followed by object grounding.
Our model requires no additional training yet shows great performance on five public datasets.
arXiv Detail & Related papers (2024-08-09T09:28:35Z) - VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning [66.23296689828152]
We leverage the capabilities of Vision-and-Large-Language Models to enhance in-context emotion classification.
In the first stage, we propose prompting VLLMs to generate descriptions in natural language of the subject's apparent emotion.
In the second stage, the descriptions are used as contextual information and, along with the image input, are used to train a transformer-based architecture.
arXiv Detail & Related papers (2024-04-10T15:09:15Z) - Hyperbolic Contrastive Learning for Visual Representations beyond
Objects [30.618032825306187]
We focus on learning representations for objects and scenes that preserve the structure among them.
Motivated by the observation that visually similar objects are close in the representation space, we argue that the scenes and objects should instead follow a hierarchical structure.
arXiv Detail & Related papers (2022-12-01T16:58:57Z) - Context-driven Visual Object Recognition based on Knowledge Graphs [0.8701566919381223]
We propose an approach that enhances deep learning methods by using external contextual knowledge encoded in a knowledge graph.
We conduct a series of experiments to investigate the impact of different contextual views on the learned object representations for the same image dataset.
arXiv Detail & Related papers (2022-10-20T13:09:00Z) - Fine-Grained Semantically Aligned Vision-Language Pre-Training [151.7372197904064]
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks.
Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts.
We introduce LO, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions.
arXiv Detail & Related papers (2022-08-04T07:51:48Z) - Scene Recognition with Objectness, Attribute and Category Learning [8.581276116041401]
Scene classification has established itself as a challenging research problem.
Image recognition serves as a key pillar for the good performance of scene recognition.
We propose a Multi-task Attribute-Scene Recognition network which learns a category embedding and at the same time predicts scene attributes.
arXiv Detail & Related papers (2022-07-20T19:51:54Z) - SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense
Reasoning [61.57887011165744]
multimodal Transformers have made great progress in the task of Visual Commonsense Reasoning.
We propose a Scene Graph Enhanced Image-Text Learning framework to incorporate visual scene graphs in commonsense reasoning.
arXiv Detail & Related papers (2021-12-16T03:16:30Z) - Learning Object Detection from Captions via Textual Scene Attributes [70.90708863394902]
We argue that captions contain much richer information about the image, including attributes of objects and their relations.
We present a method that uses the attributes in this "textual scene graph" to train object detectors.
We empirically demonstrate that the resulting model achieves state-of-the-art results on several challenging object detection datasets.
arXiv Detail & Related papers (2020-09-30T10:59:20Z) - COBE: Contextualized Object Embeddings from Narrated Instructional Video [52.73710465010274]
We propose a new framework for learning Contextualized OBject Embeddings from automatically-transcribed narrations of instructional videos.
We leverage the semantic and compositional structure of language by training a visual detector to predict a contextualized word embedding of the object and its associated narration.
Our experiments show that our detector learns to predict a rich variety of contextual object information, and that it is highly effective in the settings of few-shot and zero-shot learning.
arXiv Detail & Related papers (2020-07-14T19:04:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.