Learning Structured Representations of Visual Scenes
- URL: http://arxiv.org/abs/2207.04200v1
- Date: Sat, 9 Jul 2022 05:40:08 GMT
- Title: Learning Structured Representations of Visual Scenes
- Authors: Meng-Jiun Chiou
- Abstract summary: We study how machines can describe the content of the individual image or video with visual relationships as the structured representations.
Specifically, we explore how structured representations of visual scenes can be effectively constructed and learned in both the static-image and video settings.
- Score: 1.6244541005112747
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As the intermediate-level representations bridging the two levels, structured
representations of visual scenes, such as visual relationships between pairwise
objects, have been shown to not only benefit compositional models in learning
to reason along with the structures but provide higher interpretability for
model decisions. Nevertheless, these representations receive much less
attention than traditional recognition tasks, leaving numerous open challenges
unsolved. In the thesis, we study how machines can describe the content of the
individual image or video with visual relationships as the structured
representations. Specifically, we explore how structured representations of
visual scenes can be effectively constructed and learned in both the
static-image and video settings, with improvements resulting from external
knowledge incorporation, bias-reducing mechanism, and enhanced representation
models. At the end of this thesis, we also discuss some open challenges and
limitations to shed light on future directions of structured representation
learning for visual scenes.
Related papers
- InsightSee: Advancing Multi-agent Vision-Language Models for Enhanced Visual Understanding [12.082379948480257]
This paper proposes InsightSee, a multi-agent framework to enhance vision-language models' capabilities in handling complex visual understanding scenarios.
The framework comprises a description agent, two reasoning agents, and a decision agent, which are integrated to refine the process of visual information interpretation.
The proposed framework outperforms state-of-the-art algorithms in 6 out of 9 benchmark tests, with a substantial advancement in multimodal understanding.
arXiv Detail & Related papers (2024-05-31T13:56:55Z) - Improving In-Context Learning in Diffusion Models with Visual
Context-Modulated Prompts [83.03471704115786]
We introduce improved Prompt Diffusion (iPromptDiff) in this study.
iPromptDiff integrates an end-to-end trained vision encoder that converts visual context into an embedding vector.
We show that a diffusion-based vision foundation model, when equipped with this visual context-modulated text guidance and a standard ControlNet structure, exhibits versatility and robustness across a variety of training tasks.
arXiv Detail & Related papers (2023-12-03T14:15:52Z) - Foundational Models Defining a New Era in Vision: A Survey and Outlook [151.49434496615427]
Vision systems to see and reason about the compositional nature of visual scenes are fundamental to understanding our world.
The models learned to bridge the gap between such modalities coupled with large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time.
The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions.
arXiv Detail & Related papers (2023-07-25T17:59:18Z) - Perceptual Grouping in Contrastive Vision-Language Models [59.1542019031645]
We show how vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery.
We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information.
arXiv Detail & Related papers (2022-10-18T17:01:35Z) - Compositional Scene Representation Learning via Reconstruction: A Survey [48.33349317481124]
Compositional scene representation learning is a task that enables such abilities.
Deep neural networks have been proven to be advantageous in representation learning.
Learning via reconstruction is advantageous because it may utilize massive unlabeled data and avoid costly and laborious data annotation.
arXiv Detail & Related papers (2022-02-15T02:14:05Z) - Constellation: Learning relational abstractions over objects for
compositional imagination [64.99658940906917]
We introduce Constellation, a network that learns relational abstractions of static visual scenes.
This work is a first step in the explicit representation of visual relationships and using them for complex cognitive procedures.
arXiv Detail & Related papers (2021-07-23T11:59:40Z) - Unified Graph Structured Models for Video Understanding [93.72081456202672]
We propose a message passing graph neural network that explicitly models relational-temporal relations.
We show how our method is able to more effectively model relationships between relevant entities in the scene.
arXiv Detail & Related papers (2021-03-29T14:37:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.