QR-CLIP: Introducing Explicit Open-World Knowledge for Location and Time
Reasoning
- URL: http://arxiv.org/abs/2302.00952v3
- Date: Wed, 28 Jun 2023 09:41:25 GMT
- Title: QR-CLIP: Introducing Explicit Open-World Knowledge for Location and Time
Reasoning
- Authors: Weimin Shi, Mingchen Zhuge, Dehong Gao, Zhong Zhou, Ming-Ming Cheng,
Deng-Ping Fan
- Abstract summary: We teach machines to predict where and when images were taken rather than performing basic tasks like segmentation or classification.
Experiments show our QR-CLIP's effectiveness, and it outperforms the previous SOTA on each task by an average of about 10%.
This study lays a technical foundation for location and time reasoning and suggests that effectively introducing open-world knowledge is one of the panaceas for the tasks.
- Score: 84.20305293037683
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Daily images may convey abstract meanings that require us to memorize and
infer profound information from them. To encourage such human-like reasoning,
in this work, we teach machines to predict where and when it was taken rather
than performing basic tasks like traditional segmentation or classification.
Inspired by Horn's QR theory, we designed a novel QR-CLIP model consisting of
two components: 1) the Quantity module first retrospects more open-world
knowledge as the candidate language inputs; 2) the Relevance module carefully
estimates vision and language cues and infers the location and time.
Experiments show our QR-CLIP's effectiveness, and it outperforms the previous
SOTA on each task by an average of about 10% and 130% relative lift in terms of
location and time reasoning. This study lays a technical foundation for
location and time reasoning and suggests that effectively introducing
open-world knowledge is one of the panaceas for the tasks.
Related papers
- SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge [60.76719375410635]
We propose a new benchmark (SOK-Bench) consisting of 44K questions and 10K situations with instance-level annotations depicted in the videos.
The reasoning process is required to understand and apply situated knowledge and general knowledge for problem-solving.
We generate associated question-answer pairs and reasoning processes, finally followed by manual reviews for quality assurance.
arXiv Detail & Related papers (2024-05-15T21:55:31Z) - Asking for Knowledge: Training RL Agents to Query External Knowledge
Using Language [121.56329458876655]
We introduce two new environments: the grid-world-based Q-BabyAI and the text-based Q-TextWorld.
We propose the "Asking for Knowledge" (AFK) agent, which learns to generate language commands to query for meaningful knowledge.
arXiv Detail & Related papers (2022-05-12T14:20:31Z) - External Knowledge Augmented Text Visual Question Answering [0.6445605125467573]
We propose a framework to extract, filter, and encode knowledge atop a standard multimodal transformer for vision language understanding tasks.
We generate results comparable to the state-of-the-art on two publicly available datasets.
arXiv Detail & Related papers (2021-08-22T13:21:58Z) - Reasoning over Vision and Language: Exploring the Benefits of
Supplemental Knowledge [59.87823082513752]
This paper investigates the injection of knowledge from general-purpose knowledge bases (KBs) into vision-and-language transformers.
We empirically study the relevance of various KBs to multiple tasks and benchmarks.
The technique is model-agnostic and can expand the applicability of any vision-and-language transformer with minimal computational overhead.
arXiv Detail & Related papers (2021-01-15T08:37:55Z) - KILT: a Benchmark for Knowledge Intensive Language Tasks [102.33046195554886]
We present a benchmark for knowledge-intensive language tasks (KILT)
All tasks in KILT are grounded in the same snapshot of Wikipedia.
We find that a shared dense vector index coupled with a seq2seq model is a strong baseline.
arXiv Detail & Related papers (2020-09-04T15:32:19Z) - Common Sense or World Knowledge? Investigating Adapter-Based Knowledge
Injection into Pretrained Transformers [54.417299589288184]
We investigate models for complementing the distributional knowledge of BERT with conceptual knowledge from ConceptNet and its corresponding Open Mind Common Sense (OMCS) corpus.
Our adapter-based models substantially outperform BERT on inference tasks that require the type of conceptual knowledge explicitly present in ConceptNet and OMCS.
arXiv Detail & Related papers (2020-05-24T15:49:57Z) - REALM: Retrieval-Augmented Language Model Pre-Training [37.3178586179607]
We augment language model pre-training with a latent knowledge retriever, which allows the model to retrieve and attend over documents from a large corpus such as Wikipedia.
For the first time, we show how to pre-train such a knowledge retriever in an unsupervised manner.
We demonstrate the effectiveness of Retrieval-Augmented Language Model pre-training (REALM) by fine-tuning on the challenging task of Open-domain Question Answering (Open-QA)
arXiv Detail & Related papers (2020-02-10T18:40:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.