Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene
Graphs with Language Structures via Dependency Relationships
- URL: http://arxiv.org/abs/2203.14260v1
- Date: Sun, 27 Mar 2022 09:51:34 GMT
- Title: Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene
Graphs with Language Structures via Dependency Relationships
- Authors: Chao Lou, Wenjuan Han, Yuhuan Lin, Zilong Zheng
- Abstract summary: We introduce a new task that targets on inducing a joint vision-language structure in an unsupervised manner.
Our goal is to bridge the visual scene graphs and linguistic dependency trees seamlessly.
We propose an automatic alignment procedure to produce coarse structures followed by human refinement to produce high-quality ones.
- Score: 17.930724926012264
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Understanding realistic visual scene images together with language
descriptions is a fundamental task towards generic visual understanding.
Previous works have shown compelling comprehensive results by building
hierarchical structures for visual scenes (e.g., scene graphs) and natural
languages (e.g., dependency trees), individually. However, how to construct a
joint vision-language (VL) structure has barely been investigated. More
challenging but worthwhile, we introduce a new task that targets on inducing
such a joint VL structure in an unsupervised manner. Our goal is to bridge the
visual scene graphs and linguistic dependency trees seamlessly. Due to the lack
of VL structural data, we start by building a new dataset VLParse. Rather than
using labor-intensive labeling from scratch, we propose an automatic alignment
procedure to produce coarse structures followed by human refinement to produce
high-quality ones. Moreover, we benchmark our dataset by proposing a
contrastive learning (CL)-based framework VLGAE, short for Vision-Language
Graph Autoencoder. Our model obtains superior performance on two derived tasks,
i.e., language grammar induction and VL phrase grounding. Ablations show the
effectiveness of both visual cues and dependency relationships on fine-grained
VL structure construction.
Related papers
- From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models [81.92098140232638]
Scene graph generation (SGG) aims to parse a visual scene into an intermediate graph representation for downstream reasoning tasks.
Existing methods struggle to generate scene graphs with novel visual relation concepts.
We introduce a new open-vocabulary SGG framework based on sequence generation.
arXiv Detail & Related papers (2024-04-01T04:21:01Z) - 3VL: using Trees to teach Vision & Language models compositional
concepts [45.718319397947056]
We introduce the Tree-augmented Vision-Language (3VL) model architecture and training technique.
We show how Anchor, a simple technique for text unification, can be employed to filter nuisance factors.
We also exhibit how DiRe, which performs a differential relevancy comparison between VLM maps, enables us to generate compelling visualizations of a model's success or failure.
arXiv Detail & Related papers (2023-12-28T20:26:03Z) - ViStruct: Visual Structural Knowledge Extraction via Curriculum Guided
Code-Vision Representation [82.88378582161717]
State-of-the-art vision-language models (VLMs) still have limited performance in structural knowledge extraction.
We present ViStruct, a training framework to learn VLMs for effective visual structural knowledge extraction.
arXiv Detail & Related papers (2023-11-22T09:23:34Z) - CoVLM: Composing Visual Entities and Relationships in Large Language
Models Via Communicative Decoding [66.52659447360104]
CoVLM can guide the LLM to explicitly compose visual entities and relationships among the text.
We propose CoVLM, which can guide the LLM to explicitly compose visual entities and relationships among the text.
arXiv Detail & Related papers (2023-11-06T18:59:44Z) - Teaching Structured Vision&Language Concepts to Vision&Language Models [46.344585368641006]
We introduce the collective notion of Structured Vision&Language Concepts (SVLC)
SVLC includes object attributes, relations, and states which are present in the text and visible in the image.
We propose a more elegant data-driven approach for enhancing VL models' understanding of SVLCs.
arXiv Detail & Related papers (2022-11-21T18:54:10Z) - Unifying Vision-and-Language Tasks via Text Generation [81.3910771082967]
We propose a unified framework that learns different tasks in a single architecture.
Our models learn to generate labels in text based on the visual and textual inputs.
Our generative approach shows better generalization ability on answering questions that have rare answers.
arXiv Detail & Related papers (2021-02-04T17:59:30Z) - Language and Visual Entity Relationship Graph for Agent Navigation [54.059606864535304]
Vision-and-Language Navigation (VLN) requires an agent to navigate in a real-world environment following natural language instructions.
We propose a novel Language and Visual Entity Relationship Graph for modelling the inter-modal relationships between text and vision.
Experiments show that by taking advantage of the relationships we are able to improve over state-of-the-art.
arXiv Detail & Related papers (2020-10-19T08:25:55Z) - ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through
Scene Graph [38.97228345655337]
ERNIE-ViL tries to build the detailed semantic connections (objects, attributes of objects and relationships between objects) across vision and language.
ERNIE-ViL constructs Scene Graph Prediction tasks, i.e., Object Prediction, Attribute Prediction and Relationship Prediction tasks.
ERNIE-ViL achieves state-of-the-art performances on all these tasks and ranks the first place on the VCR leaderboard with an absolute improvement of 3.7%.
arXiv Detail & Related papers (2020-06-30T16:03:12Z) - Exploiting Structured Knowledge in Text via Graph-Guided Representation
Learning [73.0598186896953]
We present two self-supervised tasks learning over raw text with the guidance from knowledge graphs.
Building upon entity-level masked language models, our first contribution is an entity masking scheme.
In contrast to existing paradigms, our approach uses knowledge graphs implicitly, only during pre-training.
arXiv Detail & Related papers (2020-04-29T14:22:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.