Visually Grounded Concept Composition
- URL: http://arxiv.org/abs/2109.14115v1
- Date: Wed, 29 Sep 2021 00:38:58 GMT
- Title: Visually Grounded Concept Composition
- Authors: Bowen Zhang, Hexiang Hu, Linlu Qiu, Peter Shaw, Fei Sha
- Abstract summary: We learn the grounding of both primitive and all composed concepts by aligning them to images.
We show that learning to compose leads to more robust grounding results, measured in text-to-image matching accuracy.
- Score: 31.981204314287282
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We investigate ways to compose complex concepts in texts from primitive ones
while grounding them in images. We propose Concept and Relation Graph (CRG),
which builds on top of constituency analysis and consists of recursively
combined concepts with predicate functions. Meanwhile, we propose a concept
composition neural network called Composer to leverage the CRG for visually
grounded concept learning. Specifically, we learn the grounding of both
primitive and all composed concepts by aligning them to images and show that
learning to compose leads to more robust grounding results, measured in
text-to-image matching accuracy. Notably, our model can model grounded concepts
forming at both the finer-grained sentence level and the coarser-grained
intermediate level (or word-level). Composer leads to pronounced improvement in
matching accuracy when the evaluation data has significant compound divergence
from the training data.
Related papers
- CusConcept: Customized Visual Concept Decomposition with Diffusion Models [13.95568624067449]
We propose a two-stage framework, CusConcept, to extract customized visual concept embedding vectors.
In the first stage, CusConcept employs a vocabularies-guided concept decomposition mechanism.
In the second stage, joint concept refinement is performed to enhance the fidelity and quality of generated images.
arXiv Detail & Related papers (2024-10-01T04:41:44Z) - Towards Compositionality in Concept Learning [20.960438848942445]
We show that existing unsupervised concept extraction methods find concepts which are not compositional.
We propose Compositional Concept Extraction (CCE) for finding concepts which obey these properties.
CCE finds more compositional concept representations than baselines and yields better accuracy on four downstream classification tasks.
arXiv Detail & Related papers (2024-06-26T17:59:30Z) - Advancing Ante-Hoc Explainable Models through Generative Adversarial Networks [24.45212348373868]
This paper presents a novel concept learning framework for enhancing model interpretability and performance in visual classification tasks.
Our approach appends an unsupervised explanation generator to the primary classifier network and makes use of adversarial training.
This work presents a significant step towards building inherently interpretable deep vision models with task-aligned concept representations.
arXiv Detail & Related papers (2024-01-09T16:16:16Z) - Improving Image Captioning via Predicting Structured Concepts [46.88858655641866]
We propose a structured concept predictor to predict concepts and their structures, then we integrate them into captioning.
We design weighted graph convolutional networks (W-GCN) to depict concept relations driven by word dependencies.
Our approach captures potential relations among concepts and discriminatively learns different concepts, so that effectively facilitates image captioning with inherited information.
arXiv Detail & Related papers (2023-11-14T15:01:58Z) - Does Visual Pretraining Help End-to-End Reasoning? [81.4707017038019]
We investigate whether end-to-end learning of visual reasoning can be achieved with general-purpose neural networks.
We propose a simple and general self-supervised framework which "compresses" each video frame into a small set of tokens.
We observe that pretraining is essential to achieve compositional generalization for end-to-end visual reasoning.
arXiv Detail & Related papers (2023-07-17T14:08:38Z) - ConceptBed: Evaluating Concept Learning Abilities of Text-to-Image
Diffusion Models [79.10890337599166]
We introduce ConceptBed, a large-scale dataset that consists of 284 unique visual concepts and 33K composite text prompts.
We evaluate visual concepts that are either objects, attributes, or styles, and also evaluate four dimensions of compositionality: counting, attributes, relations, and actions.
Our results point to a trade-off between learning the concepts and preserving the compositionality which existing approaches struggle to overcome.
arXiv Detail & Related papers (2023-06-07T18:00:38Z) - ACSeg: Adaptive Conceptualization for Unsupervised Semantic Segmentation [17.019848796027485]
Self-supervised visual pre-training models have shown great promise in representing pixel-level semantic relationships.
In this work, we investigate the pixel-level semantic aggregation in self-trained models as image encodes and design concepts.
We propose the Adaptive Concept Generator (ACG) which adaptively maps these prototypes to informative concepts for each image.
arXiv Detail & Related papers (2022-10-12T06:16:34Z) - DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for
Open-world Detection [118.36746273425354]
This paper presents a paralleled visual-concept pre-training method for open-world detection by resorting to knowledge enrichment from a designed concept dictionary.
By enriching the concepts with their descriptions, we explicitly build the relationships among various concepts to facilitate the open-domain learning.
The proposed framework demonstrates strong zero-shot detection performances, e.g., on the LVIS dataset, our DetCLIP-T outperforms GLIP-T by 9.9% mAP and obtains a 13.5% improvement on rare categories.
arXiv Detail & Related papers (2022-09-20T02:01:01Z) - Visual Superordinate Abstraction for Robust Concept Learning [80.15940996821541]
Concept learning constructs visual representations that are connected to linguistic semantics.
We ascribe the bottleneck to a failure of exploring the intrinsic semantic hierarchy of visual concepts.
We propose a visual superordinate abstraction framework for explicitly modeling semantic-aware visual subspaces.
arXiv Detail & Related papers (2022-05-28T14:27:38Z) - Concept Learners for Few-Shot Learning [76.08585517480807]
We propose COMET, a meta-learning method that improves generalization ability by learning to learn along human-interpretable concept dimensions.
We evaluate our model on few-shot tasks from diverse domains, including fine-grained image classification, document categorization and cell type annotation.
arXiv Detail & Related papers (2020-07-14T22:04:17Z) - Gradient-Induced Co-Saliency Detection [81.54194063218216]
Co-saliency detection (Co-SOD) aims to segment the common salient foreground in a group of relevant images.
In this paper, inspired by human behavior, we propose a gradient-induced co-saliency detection method.
arXiv Detail & Related papers (2020-04-28T08:40:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.