Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal
Structured Representations
- URL: http://arxiv.org/abs/2305.06152v3
- Date: Wed, 13 Dec 2023 04:21:23 GMT
- Title: Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal
Structured Representations
- Authors: Yufeng Huang, Jiji Tang, Zhuo Chen, Rongsheng Zhang, Xinfeng Zhang,
Weijie Chen, Zeng Zhao, Zhou Zhao, Tangjie Lv, Zhipeng Hu, Wen Zhang
- Abstract summary: We present an end-to-end framework Structure-CLIP to enhance multi-modal structured representations.
We use scene graphs to guide the construction of semantic negative examples, which results in an increased emphasis on learning structured representations.
A Knowledge-Enhance (KEE) is proposed to leverage SGK as input to further enhance structured representations.
- Score: 70.41385310930846
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale vision-language pre-training has achieved significant performance
in multi-modal understanding and generation tasks. However, existing methods
often perform poorly on image-text matching tasks that require structured
representations, i.e., representations of objects, attributes, and relations.
As illustrated in Fig.~reffig:case (a), the models cannot make a distinction
between ``An astronaut rides a horse" and ``A horse rides an astronaut". This
is because they fail to fully leverage structured knowledge when learning
representations in multi-modal scenarios. In this paper, we present an
end-to-end framework Structure-CLIP, which integrates Scene Graph Knowledge
(SGK) to enhance multi-modal structured representations. Firstly, we use scene
graphs to guide the construction of semantic negative examples, which results
in an increased emphasis on learning structured representations. Moreover, a
Knowledge-Enhance Encoder (KEE) is proposed to leverage SGK as input to further
enhance structured representations. To verify the effectiveness of the proposed
framework, we pre-train our model with the aforementioned approaches and
conduct experiments on downstream tasks. Experimental results demonstrate that
Structure-CLIP achieves state-of-the-art (SOTA) performance on VG-Attribution
and VG-Relation datasets, with 12.5% and 4.1% ahead of the multi-modal SOTA
model respectively. Meanwhile, the results on MSCOCO indicate that
Structure-CLIP significantly enhances the structured representations while
maintaining the ability of general representations. Our code is available at
https://github.com/zjukg/Structure-CLIP.
Related papers
- HIP: Hierarchical Point Modeling and Pre-training for Visual Information Extraction [24.46493675079128]
OCR-dependent methods rely on offline OCR engines, while OCR-free methods might produce outputs that lack interpretability or contain hallucinated content.
We propose HIP, which models entities as HIerarchical Points to better conform to the hierarchical nature of the end-to-end VIE task.
Specifically, such hierarchical points can be flexibly encoded and subsequently decoded into desired text transcripts, centers of various regions, and categories of entities.
arXiv Detail & Related papers (2024-11-02T05:00:13Z) - Learning to Model Graph Structural Information on MLPs via Graph Structure Self-Contrasting [50.181824673039436]
We propose a Graph Structure Self-Contrasting (GSSC) framework that learns graph structural information without message passing.
The proposed framework is based purely on Multi-Layer Perceptrons (MLPs), where the structural information is only implicitly incorporated as prior knowledge.
It first applies structural sparsification to remove potentially uninformative or noisy edges in the neighborhood, and then performs structural self-contrasting in the sparsified neighborhood to learn robust node representations.
arXiv Detail & Related papers (2024-09-09T12:56:02Z) - Emergent Visual-Semantic Hierarchies in Image-Text Representations [13.300199242824934]
We study the knowledge of existing foundation models, finding that they exhibit emergent understanding of visual-semantic hierarchies.
We propose the Radial Embedding (RE) framework for probing and optimizing hierarchical understanding.
arXiv Detail & Related papers (2024-07-11T14:09:42Z) - Learning Hierarchical Prompt with Structured Linguistic Knowledge for
Vision-Language Models [43.56153167864033]
We propose a novel approach to harnessing structured knowledge in large language models (LLMs)
We introduce a relationship-guided attention module to capture pair-wise associations among entities and attributes for low-level prompt learning.
In addition, by incorporating high-level and global-level prompts, the proposed hierarchical structure forges cross-level interlinks and empowers the model to handle more complex and long-term relationships.
arXiv Detail & Related papers (2023-12-11T12:14:06Z) - Towards More Unified In-context Visual Understanding [74.55332581979292]
We present a new ICL framework for visual understanding with multi-modal output enabled.
First, we quantize and embed both text and visual prompt into a unified representational space.
Then a decoder-only sparse transformer architecture is employed to perform generative modeling on them.
arXiv Detail & Related papers (2023-12-05T06:02:21Z) - Unifying Structure and Language Semantic for Efficient Contrastive
Knowledge Graph Completion with Structured Entity Anchors [0.3913403111891026]
The goal of knowledge graph completion (KGC) is to predict missing links in a KG using trained facts that are already known.
We propose a novel method to effectively unify structure information and language semantics without losing the power of inductive reasoning.
arXiv Detail & Related papers (2023-11-07T11:17:55Z) - UniDiff: Advancing Vision-Language Models with Generative and
Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC)
UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z) - Autoregressive Structured Prediction with Language Models [73.11519625765301]
We describe an approach to model structures as sequences of actions in an autoregressive manner with PLMs.
Our approach achieves the new state-of-the-art on all the structured prediction tasks we looked at.
arXiv Detail & Related papers (2022-10-26T13:27:26Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.