Weakly Supervised Visual Semantic Parsing
- URL: http://arxiv.org/abs/2001.02359v2
- Date: Tue, 31 Mar 2020 18:54:06 GMT
- Title: Weakly Supervised Visual Semantic Parsing
- Authors: Alireza Zareian, Svebor Karaman, Shih-Fu Chang
- Abstract summary: Scene Graph Generation (SGG) aims to extract entities, predicates and their semantic structure from images.
Existing SGG methods require millions of manually annotated bounding boxes for training.
We propose Visual Semantic Parsing, VSPNet, and graph-based weakly supervised learning framework.
- Score: 49.69377653925448
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scene Graph Generation (SGG) aims to extract entities, predicates and their
semantic structure from images, enabling deep understanding of visual content,
with many applications such as visual reasoning and image retrieval.
Nevertheless, existing SGG methods require millions of manually annotated
bounding boxes for training, and are computationally inefficient, as they
exhaustively process all pairs of object proposals to detect predicates. In
this paper, we address those two limitations by first proposing a generalized
formulation of SGG, namely Visual Semantic Parsing, which disentangles entity
and predicate recognition, and enables sub-quadratic performance. Then we
propose the Visual Semantic Parsing Network, VSPNet, based on a dynamic,
attention-based, bipartite message passing framework that jointly infers graph
nodes and edges through an iterative process. Additionally, we propose the
first graph-based weakly supervised learning framework, based on a novel graph
alignment algorithm, which enables training without bounding box annotations.
Through extensive experiments, we show that VSPNet outperforms weakly
supervised baselines significantly and approaches fully supervised performance,
while being several times faster. We publicly release the source code of our
method.
Related papers
- SimTeG: A Frustratingly Simple Approach Improves Textual Graph Learning [131.04781590452308]
We present SimTeG, a frustratingly Simple approach for Textual Graph learning.
We first perform supervised parameter-efficient fine-tuning (PEFT) on a pre-trained LM on the downstream task.
We then generate node embeddings using the last hidden states of finetuned LM.
arXiv Detail & Related papers (2023-08-03T07:00:04Z) - CLIP-Count: Towards Text-Guided Zero-Shot Object Counting [32.07271723717184]
We propose CLIP-Count, the first end-to-end pipeline that estimates density maps for open-vocabulary objects with text guidance in a zero-shot manner.
To align the text embedding with dense visual features, we introduce a patch-text contrastive loss that guides the model to learn informative patch-level visual representations for dense prediction.
Our method effectively generates high-quality density maps for objects-of-interest.
arXiv Detail & Related papers (2023-05-12T08:19:39Z) - Jointly Visual- and Semantic-Aware Graph Memory Networks for Temporal
Sentence Localization in Videos [67.12603318660689]
We propose a novel Hierarchical Visual- and Semantic-Aware Reasoning Network (HVSARN)
HVSARN enables both visual- and semantic-aware query reasoning from object-level to frame-level.
Experiments on three datasets demonstrate that our HVSARN achieves a new state-of-the-art performance.
arXiv Detail & Related papers (2023-03-02T08:00:22Z) - Towards Open-vocabulary Scene Graph Generation with Prompt-based
Finetuning [84.39787427288525]
Scene graph generation (SGG) is a fundamental task aimed at detecting visual relations between objects in an image.
We introduce open-vocabulary scene graph generation, a novel, realistic and challenging setting in which a model is trained on a set of base object classes.
Our method can support inference over completely unseen object classes, which existing methods are incapable of handling.
arXiv Detail & Related papers (2022-08-17T09:05:38Z) - Learning the Implicit Semantic Representation on Graph-Structured Data [57.670106959061634]
Existing representation learning methods in graph convolutional networks are mainly designed by describing the neighborhood of each node as a perceptual whole.
We propose a Semantic Graph Convolutional Networks (SGCN) that explores the implicit semantics by learning latent semantic-paths in graphs.
arXiv Detail & Related papers (2021-01-16T16:18:43Z) - A Graph-based Interactive Reasoning for Human-Object Interaction
Detection [71.50535113279551]
We present a novel graph-based interactive reasoning model called Interactive Graph (abbr. in-Graph) to infer HOIs.
We construct a new framework to assemble in-Graph models for detecting HOIs, namely in-GraphNet.
Our framework is end-to-end trainable and free from costly annotations like human pose.
arXiv Detail & Related papers (2020-07-14T09:29:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.