VLGrammar: Grounded Grammar Induction of Vision and Language
- URL: http://arxiv.org/abs/2103.12975v1
- Date: Wed, 24 Mar 2021 04:05:08 GMT
- Title: VLGrammar: Grounded Grammar Induction of Vision and Language
- Authors: Yining Hong, Qing Li, Song-Chun Zhu, Siyuan Huang
- Abstract summary: We study grounded grammar induction of vision and language in a joint learning framework.
We present VLGrammar, a method that uses compound probabilistic context-free grammars (compound PCFGs) to induce the language grammar and the image grammar simultaneously.
- Score: 86.88273769411428
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cognitive grammar suggests that the acquisition of language grammar is
grounded within visual structures. While grammar is an essential representation
of natural language, it also exists ubiquitously in vision to represent the
hierarchical part-whole structure. In this work, we study grounded grammar
induction of vision and language in a joint learning framework. Specifically,
we present VLGrammar, a method that uses compound probabilistic context-free
grammars (compound PCFGs) to induce the language grammar and the image grammar
simultaneously. We propose a novel contrastive learning framework to guide the
joint learning of both modules. To provide a benchmark for the grounded grammar
induction task, we collect a large-scale dataset, \textsc{PartIt}, which
contains human-written sentences that describe part-level semantics for 3D
objects. Experiments on the \textsc{PartIt} dataset show that VLGrammar
outperforms all baselines in image grammar induction and language grammar
induction. The learned VLGrammar naturally benefits related downstream tasks.
Specifically, it improves the image unsupervised clustering accuracy by 30\%,
and performs well in image retrieval and text retrieval. Notably, the induced
grammar shows superior generalizability by easily generalizing to unseen
categories.
Related papers
- Detecting and explaining (in)equivalence of context-free grammars [0.6282171844772422]
We propose a scalable framework for deciding, proving, and explaining (in)equivalence of context-free grammars.
We present an implementation of the framework and evaluate it on large data sets collected within educational support systems.
arXiv Detail & Related papers (2024-07-25T17:36:18Z) - Learning Language Structures through Grounding [8.437466837766895]
We consider a family of machine learning tasks that aim to learn language structures through grounding.
In Part I, we consider learning syntactic parses through visual grounding.
In Part II, we propose two execution-aware methods to map sentences into corresponding semantic structures.
In Part III, we propose methods that learn language structures from annotations in other languages.
arXiv Detail & Related papers (2024-06-14T02:21:53Z) - Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene
Graphs with Language Structures via Dependency Relationships [17.930724926012264]
We introduce a new task that targets on inducing a joint vision-language structure in an unsupervised manner.
Our goal is to bridge the visual scene graphs and linguistic dependency trees seamlessly.
We propose an automatic alignment procedure to produce coarse structures followed by human refinement to produce high-quality ones.
arXiv Detail & Related papers (2022-03-27T09:51:34Z) - Learning grammar with a divide-and-concur neural network [4.111899441919164]
We implement a divide-and-concur iterative projection approach to context-free grammar inference.
Our method requires a relatively small number of discrete parameters, making the inferred grammar directly interpretable.
arXiv Detail & Related papers (2022-01-18T22:42:43Z) - Dependency Induction Through the Lens of Visual Perception [81.91502968815746]
We propose an unsupervised grammar induction model that leverages word concreteness and a structural vision-based to jointly learn constituency-structure and dependency-structure grammars.
Our experiments show that the proposed extension outperforms the current state-of-the-art visually grounded models in constituency parsing even with a smaller grammar size.
arXiv Detail & Related papers (2021-09-20T18:40:37Z) - Video-aided Unsupervised Grammar Induction [108.53765268059425]
We investigate video-aided grammar induction, which learns a constituency from both unlabeled text and its corresponding video.
Video provides even richer information, including not only static objects but also actions and state changes useful for inducing verb phrases.
We propose a Multi-Modal Compound PCFG model (MMC-PCFG) to effectively aggregate these rich features from different modalities.
arXiv Detail & Related papers (2021-04-09T14:01:36Z) - Visually Grounded Compound PCFGs [65.04669567781634]
Exploiting visual groundings for language understanding has recently been drawing much attention.
We study visually grounded grammar induction and learn a constituency from both unlabeled text and its visual captions.
arXiv Detail & Related papers (2020-09-25T19:07:00Z) - Structure-Augmented Text Representation Learning for Efficient Knowledge
Graph Completion [53.31911669146451]
Human-curated knowledge graphs provide critical supportive information to various natural language processing tasks.
These graphs are usually incomplete, urging auto-completion of them.
graph embedding approaches, e.g., TransE, learn structured knowledge via representing graph elements into dense embeddings.
textual encoding approaches, e.g., KG-BERT, resort to graph triple's text and triple-level contextualized representations.
arXiv Detail & Related papers (2020-04-30T13:50:34Z) - Exploiting Structured Knowledge in Text via Graph-Guided Representation
Learning [73.0598186896953]
We present two self-supervised tasks learning over raw text with the guidance from knowledge graphs.
Building upon entity-level masked language models, our first contribution is an entity masking scheme.
In contrast to existing paradigms, our approach uses knowledge graphs implicitly, only during pre-training.
arXiv Detail & Related papers (2020-04-29T14:22:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.