Fine-Grained Visual Entailment
- URL: http://arxiv.org/abs/2203.15704v1
- Date: Tue, 29 Mar 2022 16:09:38 GMT
- Title: Fine-Grained Visual Entailment
- Authors: Christopher Thomas and Yipeng Zhang and Shih-Fu Chang
- Abstract summary: We propose an extension of this task, where the goal is to predict the logical relationship of fine-grained knowledge elements within a piece of text to an image.
Unlike prior work, our method is inherently explainable and makes logical predictions at different levels of granularity.
We evaluate our method on a new dataset of manually annotated knowledge elements and show that our method achieves 68.18% accuracy at this challenging task.
- Score: 51.66881737644983
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual entailment is a recently proposed multimodal reasoning task where the
goal is to predict the logical relationship of a piece of text to an image. In
this paper, we propose an extension of this task, where the goal is to predict
the logical relationship of fine-grained knowledge elements within a piece of
text to an image. Unlike prior work, our method is inherently explainable and
makes logical predictions at different levels of granularity. Because we lack
fine-grained labels to train our method, we propose a novel multi-instance
learning approach which learns a fine-grained labeling using only sample-level
supervision. We also impose novel semantic structural constraints which ensure
that fine-grained predictions are internally semantically consistent. We
evaluate our method on a new dataset of manually annotated knowledge elements
and show that our method achieves 68.18\% accuracy at this challenging task
while significantly outperforming several strong baselines. Finally, we present
extensive qualitative results illustrating our method's predictions and the
visual evidence our method relied on. Our code and annotated dataset can be
found here: https://github.com/SkrighYZ/FGVE.
Related papers
- Probabilistic Prompt Learning for Dense Prediction [45.577125507777474]
We present a novel probabilistic prompt learning to fully exploit the vision-language knowledge in dense prediction tasks.
We introduce learnable class-agnostic attribute prompts to describe universal attributes across the object class.
The attributes are combined with class information and visual-context knowledge to define the class-specific textual distribution.
arXiv Detail & Related papers (2023-04-03T08:01:27Z) - SMiLE: Schema-augmented Multi-level Contrastive Learning for Knowledge
Graph Link Prediction [28.87290783250351]
Link prediction is the task of inferring missing links between entities in knowledge graphs.
We propose a novel Multi-level contrastive LEarning framework (SMiLE) to conduct knowledge graph link prediction.
arXiv Detail & Related papers (2022-10-10T17:40:19Z) - New Intent Discovery with Pre-training and Contrastive Learning [21.25371293641141]
New intent discovery aims to uncover novel intent categories from user utterances to expand the set of supported intent classes.
Existing approaches typically rely on a large amount of labeled utterances.
We propose a new contrastive loss to exploit self-supervisory signals in unlabeled data for clustering.
arXiv Detail & Related papers (2022-05-25T17:07:25Z) - Budget-aware Few-shot Learning via Graph Convolutional Network [56.41899553037247]
This paper tackles the problem of few-shot learning, which aims to learn new visual concepts from a few examples.
A common problem setting in few-shot classification assumes random sampling strategy in acquiring data labels.
We introduce a new budget-aware few-shot learning problem that aims to learn novel object categories.
arXiv Detail & Related papers (2022-01-07T02:46:35Z) - DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting [91.56988987393483]
We present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP.
Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models.
Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones.
arXiv Detail & Related papers (2021-12-02T18:59:32Z) - Learning to Generate Scene Graph from Natural Language Supervision [52.18175340725455]
We propose one of the first methods that learn from image-sentence pairs to extract a graphical representation of localized objects and their relationships within an image, known as scene graph.
We leverage an off-the-shelf object detector to identify and localize object instances, match labels of detected regions to concepts parsed from captions, and thus create "pseudo" labels for learning scene graph.
arXiv Detail & Related papers (2021-09-06T03:38:52Z) - Predicting What You Already Know Helps: Provable Self-Supervised
Learning [60.27658820909876]
Self-supervised representation learning solves auxiliary prediction tasks (known as pretext tasks) without requiring labeled data.
We show a mechanism exploiting the statistical connections between certain em reconstruction-based pretext tasks that guarantee to learn a good representation.
We prove the linear layer yields small approximation error even for complex ground truth function class.
arXiv Detail & Related papers (2020-08-03T17:56:13Z) - Exploiting Structured Knowledge in Text via Graph-Guided Representation
Learning [73.0598186896953]
We present two self-supervised tasks learning over raw text with the guidance from knowledge graphs.
Building upon entity-level masked language models, our first contribution is an entity masking scheme.
In contrast to existing paradigms, our approach uses knowledge graphs implicitly, only during pre-training.
arXiv Detail & Related papers (2020-04-29T14:22:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.