Less is More: Toward Zero-Shot Local Scene Graph Generation via
Foundation Models
- URL: http://arxiv.org/abs/2310.01356v1
- Date: Mon, 2 Oct 2023 17:19:04 GMT
- Title: Less is More: Toward Zero-Shot Local Scene Graph Generation via
Foundation Models
- Authors: Shu Zhao, Huijuan Xu
- Abstract summary: We present a new task called Local Scene Graph Generation.
It aims to abstract pertinent structural information with partial objects and their relationships in an image.
We introduce zEro-shot Local scEne GrAph geNeraTion (ELEGANT), a framework harnessing foundation models renowned for their powerful perception and commonsense reasoning.
- Score: 16.08214739525615
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Humans inherently recognize objects via selective visual perception,
transform specific regions from the visual field into structured symbolic
knowledge, and reason their relationships among regions based on the allocation
of limited attention resources in line with humans' goals. While it is
intuitive for humans, contemporary perception systems falter in extracting
structural information due to the intricate cognitive abilities and commonsense
knowledge required. To fill this gap, we present a new task called Local Scene
Graph Generation. Distinct from the conventional scene graph generation task,
which encompasses generating all objects and relationships in an image, our
proposed task aims to abstract pertinent structural information with partial
objects and their relationships for boosting downstream tasks that demand
advanced comprehension and reasoning capabilities. Correspondingly, we
introduce zEro-shot Local scEne GrAph geNeraTion (ELEGANT), a framework
harnessing foundation models renowned for their powerful perception and
commonsense reasoning, where collaboration and information communication among
foundation models yield superior outcomes and realize zero-shot local scene
graph generation without requiring labeled supervision. Furthermore, we propose
a novel open-ended evaluation metric, Entity-level CLIPScorE (ECLIPSE),
surpassing previous closed-set evaluation metrics by transcending their limited
label space, offering a broader assessment. Experiment results show that our
approach markedly outperforms baselines in the open-ended evaluation setting,
and it also achieves a significant performance boost of up to 24.58% over prior
methods in the close-set setting, demonstrating the effectiveness and powerful
reasoning ability of our proposed framework.
Related papers
- Zero-Shot Object-Centric Representation Learning [72.43369950684057]
We study current object-centric methods through the lens of zero-shot generalization.
We introduce a benchmark comprising eight different synthetic and real-world datasets.
We find that training on diverse real-world images improves transferability to unseen scenarios.
arXiv Detail & Related papers (2024-08-17T10:37:07Z) - Augmented Commonsense Knowledge for Remote Object Grounding [67.30864498454805]
We propose an augmented commonsense knowledge model (ACK) to leverage commonsense information as atemporal knowledge graph for improving agent navigation.
ACK consists of knowledge graph-aware cross-modal and concept aggregation modules to enhance visual representation and visual-textual data alignment.
We add a new pipeline for the commonsense-based decision-making process which leads to more accurate local action prediction.
arXiv Detail & Related papers (2024-06-03T12:12:33Z) - 3D WholeBody Pose Estimation based on Semantic Graph Attention Network and Distance Information [2.457872341625575]
A novel Semantic Graph Attention Network can benefit from the ability of self-attention to capture global context.
A Body Part Decoder assists in extracting and refining the information related to specific segments of the body.
A Geometry Loss makes a critical constraint on the structural skeleton of the body, ensuring that the model's predictions adhere to the natural limits of human posture.
arXiv Detail & Related papers (2024-06-03T10:59:00Z) - Fusing Domain-Specific Content from Large Language Models into Knowledge Graphs for Enhanced Zero Shot Object State Classification [0.8232137862012223]
This study investigates the potential of Large Language Models (LLMs) in generating and providing domain-specific information.
To achieve this, an LLM is integrated into a pipeline that utilizes Knowledge Graphs and pre-trained semantic vectors.
Our findings reveal that the integration of LLM-based embeddings, in combination with general-purpose pre-trained embeddings, leads to substantial performance improvements.
arXiv Detail & Related papers (2024-03-18T18:08:44Z) - Optimization Efficient Open-World Visual Region Recognition [55.76437190434433]
RegionSpot integrates position-aware localization knowledge from a localization foundation model with semantic information from a ViL model.
Experiments in open-world object recognition show that our RegionSpot achieves significant performance gain over prior alternatives.
arXiv Detail & Related papers (2023-11-02T16:31:49Z) - Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner.
We design a semantic-guided self-supervised learning model to extract high-level semantic features from images.
We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z) - Robust Saliency-Aware Distillation for Few-shot Fine-grained Visual
Recognition [57.08108545219043]
Recognizing novel sub-categories with scarce samples is an essential and challenging research topic in computer vision.
Existing literature addresses this challenge by employing local-based representation approaches.
This article proposes a novel model, Robust Saliency-aware Distillation (RSaD), for few-shot fine-grained visual recognition.
arXiv Detail & Related papers (2023-05-12T00:13:17Z) - Learning Attention-based Representations from Multiple Patterns for
Relation Prediction in Knowledge Graphs [2.4028383570062606]
AEMP is a novel model for learning contextualized representations by acquiring entities' context information.
AEMP either outperforms or competes with state-of-the-art relation prediction methods.
arXiv Detail & Related papers (2022-06-07T10:53:35Z) - Exploiting Structured Knowledge in Text via Graph-Guided Representation
Learning [73.0598186896953]
We present two self-supervised tasks learning over raw text with the guidance from knowledge graphs.
Building upon entity-level masked language models, our first contribution is an entity masking scheme.
In contrast to existing paradigms, our approach uses knowledge graphs implicitly, only during pre-training.
arXiv Detail & Related papers (2020-04-29T14:22:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.