Visually Grounded Commonsense Knowledge Acquisition
- URL: http://arxiv.org/abs/2211.12054v2
- Date: Sat, 25 Mar 2023 07:16:48 GMT
- Title: Visually Grounded Commonsense Knowledge Acquisition
- Authors: Yuan Yao, Tianyu Yu, Ao Zhang, Mengdi Li, Ruobing Xie, Cornelius
Weber, Zhiyuan Liu, Hai-Tao Zheng, Stefan Wermter, Tat-Seng Chua, Maosong Sun
- Abstract summary: Large-scale commonsense knowledge bases empower a broad range of AI applications.
Visual perception contains rich commonsense knowledge about real-world entities.
We present CLEVER, which formulates CKE as a distantly supervised multi-instance learning problem.
- Score: 132.42003872906062
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale commonsense knowledge bases empower a broad range of AI
applications, where the automatic extraction of commonsense knowledge (CKE) is
a fundamental and challenging problem. CKE from text is known for suffering
from the inherent sparsity and reporting bias of commonsense in text. Visual
perception, on the other hand, contains rich commonsense knowledge about
real-world entities, e.g., (person, can_hold, bottle), which can serve as
promising sources for acquiring grounded commonsense knowledge. In this work,
we present CLEVER, which formulates CKE as a distantly supervised
multi-instance learning problem, where models learn to summarize commonsense
relations from a bag of images about an entity pair without any human
annotation on image instances. To address the problem, CLEVER leverages
vision-language pre-training models for deep understanding of each image in the
bag, and selects informative instances from the bag to summarize commonsense
entity relations via a novel contrastive attention mechanism. Comprehensive
experimental results in held-out and human evaluation show that CLEVER can
extract commonsense knowledge in promising quality, outperforming pre-trained
language model-based methods by 3.9 AUC and 6.4 mAUC points. The predicted
commonsense scores show strong correlation with human judgment with a 0.78
Spearman coefficient. Moreover, the extracted commonsense can also be grounded
into images with reasonable interpretability. The data and codes can be
obtained at https://github.com/thunlp/CLEVER.
Related papers
- What Really is Commonsense Knowledge? [58.5342212738895]
We survey existing definitions of commonsense knowledge, ground into the three frameworks for defining concepts, and consolidate them into a unified definition of commonsense knowledge.
We then use the consolidated definition for annotations and experiments on the CommonsenseQA and CommonsenseQA 2.0 datasets.
Our study shows that there exists a large portion of non-commonsense-knowledge instances in the two datasets, and a large performance gap on these two subsets.
arXiv Detail & Related papers (2024-11-06T14:54:19Z) - Recognizing Unseen Objects via Multimodal Intensive Knowledge Graph
Propagation [68.13453771001522]
We propose a multimodal intensive ZSL framework that matches regions of images with corresponding semantic embeddings.
We conduct extensive experiments and evaluate our model on large-scale real-world data.
arXiv Detail & Related papers (2023-06-14T13:07:48Z) - Commonsense Knowledge Transfer for Pre-trained Language Models [83.01121484432801]
We introduce commonsense knowledge transfer, a framework to transfer the commonsense knowledge stored in a neural commonsense knowledge model to a general-purpose pre-trained language model.
It first exploits general texts to form queries for extracting commonsense knowledge from the neural commonsense knowledge model.
It then refines the language model with two self-supervised objectives: commonsense mask infilling and commonsense relation prediction.
arXiv Detail & Related papers (2023-06-04T15:44:51Z) - ComFact: A Benchmark for Linking Contextual Commonsense Knowledge [31.19689856957576]
We propose the new task of commonsense fact linking, where models are given contexts and trained to identify situationally-relevant commonsense knowledge from KGs.
Our novel benchmark, ComFact, contains 293k in-context relevance annotations for commonsense across four stylistically diverse datasets.
arXiv Detail & Related papers (2022-10-23T09:30:39Z) - Exploring CLIP for Assessing the Look and Feel of Images [87.97623543523858]
We introduce Contrastive Language-Image Pre-training (CLIP) models for assessing both the quality perception (look) and abstract perception (feel) of images in a zero-shot manner.
Our results show that CLIP captures meaningful priors that generalize well to different perceptual assessments.
arXiv Detail & Related papers (2022-07-25T17:58:16Z) - Knowledge Graph Augmented Network Towards Multiview Representation
Learning for Aspect-based Sentiment Analysis [96.53859361560505]
We propose a knowledge graph augmented network (KGAN) to incorporate external knowledge with explicitly syntactic and contextual information.
KGAN captures the sentiment feature representations from multiple perspectives, i.e., context-, syntax- and knowledge-based.
Experiments on three popular ABSA benchmarks demonstrate the effectiveness and robustness of our KGAN.
arXiv Detail & Related papers (2022-01-13T08:25:53Z) - Commonsense Knowledge in Word Associations and ConceptNet [37.751909219863585]
This paper presents an in-depth comparison of two large-scale resources of general knowledge: ConcpetNet and SWOW.
We examine the structure, overlap and differences between the two graphs, as well as the extent to which they encode situational commonsense knowledge.
arXiv Detail & Related papers (2021-09-20T06:06:30Z) - Latent Correlation-Based Multiview Learning and Self-Supervision: A
Unifying Perspective [41.80156041871873]
This work puts forth a theory-backed framework for unsupervised multiview learning.
Our development starts with proposing a multiview model, where each view is a nonlinear mixture of shared and private components.
In addition, the private information in each view can be provably disentangled from the shared using proper regularization design.
arXiv Detail & Related papers (2021-06-14T00:12:36Z) - KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for Visual
Commonsense Reasoning [4.787501955202053]
In visual commonsense reasoning (VCR) task, a machine must answer correctly and then provide a rationale justifying its answer.
We propose a novel Knowledge Enhanced Visual-and-Linguistic BERT (KVL-BERT for short) model.
Besides taking visual and linguistic contents as input, external commonsense knowledge extracted from ConceptNet is integrated into the multi-layer Transformer.
arXiv Detail & Related papers (2020-12-13T08:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.