ViStruct: Visual Structural Knowledge Extraction via Curriculum Guided
Code-Vision Representation
- URL: http://arxiv.org/abs/2311.13258v1
- Date: Wed, 22 Nov 2023 09:23:34 GMT
- Title: ViStruct: Visual Structural Knowledge Extraction via Curriculum Guided
Code-Vision Representation
- Authors: Yangyi Chen, Xingyao Wang, Manling Li, Derek Hoiem, Heng Ji
- Abstract summary: State-of-the-art vision-language models (VLMs) still have limited performance in structural knowledge extraction.
We present ViStruct, a training framework to learn VLMs for effective visual structural knowledge extraction.
- Score: 82.88378582161717
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: State-of-the-art vision-language models (VLMs) still have limited performance
in structural knowledge extraction, such as relations between objects. In this
work, we present ViStruct, a training framework to learn VLMs for effective
visual structural knowledge extraction. Two novel designs are incorporated.
First, we propose to leverage the inherent structure of programming language to
depict visual structural information. This approach enables explicit and
consistent representation of visual structural information of multiple
granularities, such as concepts, relations, and events, in a well-organized
structured format. Second, we introduce curriculum-based learning for VLMs to
progressively comprehend visual structures, from fundamental visual concepts to
intricate event structures. Our intuition is that lower-level knowledge may
contribute to complex visual structure understanding. Furthermore, we compile
and release a collection of datasets tailored for visual structural knowledge
extraction. We adopt a weakly-supervised approach to directly generate visual
event structures from captions for ViStruct training, capitalizing on abundant
image-caption pairs from the web. In experiments, we evaluate ViStruct on
visual structure prediction tasks, demonstrating its effectiveness in improving
the understanding of visual structures. The code is public at
\url{https://github.com/Yangyi-Chen/vi-struct}.
Related papers
- Learning Correlation Structures for Vision Transformers [93.22434535223587]
We introduce a new attention mechanism, dubbed structural self-attention (StructSA)
We generate attention maps by recognizing space-time structures of key-query correlations via convolution.
This effectively leverages rich structural patterns in images and videos such as scene layouts, object motion, and inter-object relations.
arXiv Detail & Related papers (2024-04-05T07:13:28Z) - mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding [100.17063271791528]
We propose the Unified Structure Learning to boost the performance of MLLMs.
Our model DocOwl 1.5 achieves state-of-the-art performance on 10 visual document understanding benchmarks.
arXiv Detail & Related papers (2024-03-19T16:48:40Z) - Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal
Structured Representations [70.41385310930846]
We present an end-to-end framework Structure-CLIP to enhance multi-modal structured representations.
We use scene graphs to guide the construction of semantic negative examples, which results in an increased emphasis on learning structured representations.
A Knowledge-Enhance (KEE) is proposed to leverage SGK as input to further enhance structured representations.
arXiv Detail & Related papers (2023-05-06T03:57:05Z) - Task-specific Scene Structure Representations [13.775485887433815]
We propose a single general neural network architecture for extracting task-specific structure guidance for scenes.
Our main contribution is to show that such a simple network can achieve state-of-the-art results for several low-level vision applications.
arXiv Detail & Related papers (2023-01-02T08:25:47Z) - Learning Structured Representations of Visual Scenes [1.6244541005112747]
We study how machines can describe the content of the individual image or video with visual relationships as the structured representations.
Specifically, we explore how structured representations of visual scenes can be effectively constructed and learned in both the static-image and video settings.
arXiv Detail & Related papers (2022-07-09T05:40:08Z) - K-LITE: Learning Transferable Visual Models with External Knowledge [242.3887854728843]
K-LITE (Knowledge-augmented Language-Image Training and Evaluation) is a strategy to leverage external knowledge to build transferable visual systems.
In training, it enriches entities in natural language with WordNet and Wiktionary knowledge.
In evaluation, the natural language is also augmented with external knowledge and then used to reference learned visual concepts.
arXiv Detail & Related papers (2022-04-20T04:47:01Z) - Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene
Graphs with Language Structures via Dependency Relationships [17.930724926012264]
We introduce a new task that targets on inducing a joint vision-language structure in an unsupervised manner.
Our goal is to bridge the visual scene graphs and linguistic dependency trees seamlessly.
We propose an automatic alignment procedure to produce coarse structures followed by human refinement to produce high-quality ones.
arXiv Detail & Related papers (2022-03-27T09:51:34Z) - Integrating Visuospatial, Linguistic and Commonsense Structure into
Story Visualization [81.26077816854449]
We first explore the use of constituency parse trees for encoding structured input.
Second, we augment the structured input with commonsense information and study the impact of this external knowledge on the generation of visual story.
Third, we incorporate visual structure via bounding boxes and dense captioning to provide feedback about the characters/objects in generated images.
arXiv Detail & Related papers (2021-10-21T00:16:02Z) - Learning to Incorporate Structure Knowledge for Image Inpainting [20.93448933499842]
This paper develops a multi-task learning framework that attempts to incorporate the image structure knowledge to assist image inpainting.
The primary idea is to train a shared generator to simultaneously complete the corrupted image and corresponding structures.
In the meantime, we also introduce a structure embedding scheme to explicitly embed the learned structure features into the inpainting process.
arXiv Detail & Related papers (2020-02-11T02:22:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.