COFAR: Commonsense and Factual Reasoning in Image Search
- URL: http://arxiv.org/abs/2210.08554v1
- Date: Sun, 16 Oct 2022 14:43:13 GMT
- Title: COFAR: Commonsense and Factual Reasoning in Image Search
- Authors: Prajwal Gatti, Abhirama Subramanyam Penamakuri, Revant Teotia, Anand
Mishra, Shubhashis Sengupta, Roshni Ramnani
- Abstract summary: One characteristic that makes humans superior to modern artificially intelligent models is the ability to interpret images beyond what is visually apparent.
We present a unified framework, namely Knowledge Retrieval-Augmented Multimodal Transformer (KRAMT), that treats the named visual entities in an image as a gateway to encyclopedic knowledge.
This unified framework is then used to perform image search requiring commonsense and factual reasoning.
- Score: 2.6354148238224697
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: One characteristic that makes humans superior to modern artificially
intelligent models is the ability to interpret images beyond what is visually
apparent. Consider the following two natural language search queries - (i) "a
queue of customers patiently waiting to buy ice cream" and (ii) "a queue of
tourists going to see a famous Mughal architecture in India." Interpreting
these queries requires one to reason with (i) Commonsense such as interpreting
people as customers or tourists, actions as waiting to buy or going to see; and
(ii) Fact or world knowledge associated with named visual entities, for
example, whether the store in the image sells ice cream or whether the landmark
in the image is a Mughal architecture located in India. Such reasoning goes
beyond just visual recognition. To enable both commonsense and factual
reasoning in the image search, we present a unified framework, namely Knowledge
Retrieval-Augmented Multimodal Transformer (KRAMT), that treats the named
visual entities in an image as a gateway to encyclopedic knowledge and
leverages them along with natural language query to ground relevant knowledge.
Further, KRAMT seamlessly integrates visual content and grounded knowledge to
learn alignment between images and search queries. This unified framework is
then used to perform image search requiring commonsense and factual reasoning.
The retrieval performance of KRAMT is evaluated and compared with related
approaches on a new dataset we introduce - namely COFAR. We make our code and
dataset available at https://vl2g.github.io/projects/cofar
Related papers
- An image speaks a thousand words, but can everyone listen? On image transcreation for cultural relevance [53.974497865647336]
We take a first step towards translating images to make them culturally relevant.
We build three pipelines comprising state-of-the-art generative models to do the task.
We conduct a human evaluation of translated images to assess for cultural relevance and meaning preservation.
arXiv Detail & Related papers (2024-04-01T17:08:50Z) - MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions [64.89284104414865]
We introduce MagicLens, a series of self-supervised image retrieval models that support open-ended instructions.
MagicLens is built on a key novel insight: image pairs that naturally occur on the same web pages contain a wide range of implicit relations.
MagicLens achieves results comparable with or better than prior best on eight benchmarks of various image retrieval tasks.
arXiv Detail & Related papers (2024-03-28T17:59:20Z) - Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of
Synthetic and Compositional Images [63.629345688220496]
We introduce WHOOPS!, a new dataset and benchmark for visual commonsense.
The dataset is comprised of purposefully commonsense-defying images created by designers.
Our results show that state-of-the-art models such as GPT3 and BLIP2 still lag behind human performance on WHOOPS!
arXiv Detail & Related papers (2023-03-13T16:49:43Z) - The Curious Layperson: Fine-Grained Image Recognition without Expert
Labels [90.88501867321573]
We consider a new problem: fine-grained image recognition without expert annotations.
We learn a model to describe the visual appearance of objects using non-expert image descriptions.
We then train a fine-grained textual similarity model that matches image descriptions with documents on a sentence-level basis.
arXiv Detail & Related papers (2021-11-05T17:58:37Z) - Image Retrieval on Real-life Images with Pre-trained Vision-and-Language
Models [41.7254780975984]
We extend the task of composed image retrieval, where an input query consists of an image and short textual description of how to modify the image.
We propose CIRPLANT, a transformer based model that leverages rich pre-trained vision-and-language (V&L) knowledge for modifying visual features conditioned on natural language.
We demonstrate that with a relatively simple architecture, CIRPLANT outperforms existing methods on open-domain images, while matching state-of-the-art accuracy on the existing narrow datasets, such as fashion.
arXiv Detail & Related papers (2021-08-09T13:25:06Z) - Image Translation via Fine-grained Knowledge Transfer [36.898373109689814]
We propose an interpretable knowledge-based image-translation framework, which realizes the image-translation through knowledge retrieval and transfer.
In details, the framework constructs a plug-and-play and model-agnostic general purpose knowledge library, remembering task-specific styles, tones, texture patterns, etc.
arXiv Detail & Related papers (2020-12-21T09:18:48Z) - TextMage: The Automated Bangla Caption Generator Based On Deep Learning [1.2330326247154968]
TextMage is a system that is capable of understanding visual scenes that belong to the Bangladeshi geographical context.
This dataset contains 9,154 images along with two annotations for each image.
arXiv Detail & Related papers (2020-10-15T23:24:15Z) - Beyond Language: Learning Commonsense from Images for Reasoning [78.33934895163736]
This paper proposes a novel approach to learn commonsense from images, instead of limited raw texts or costly constructed knowledge bases.
Our motivation comes from the fact that an image is worth a thousand words, where richer scene information could be leveraged to help distill the commonsense knowledge.
arXiv Detail & Related papers (2020-10-10T13:47:13Z) - Adaptive Semantic-Visual Tree for Hierarchical Embeddings [67.01307058209709]
We propose a hierarchical adaptive semantic-visual tree to depict the architecture of merchandise categories.
The tree evaluates semantic similarities between different semantic levels and visual similarities within the same semantic class simultaneously.
At each level, we set different margins based on the semantic hierarchy and incorporate them as prior information to learn a fine-grained feature embedding.
arXiv Detail & Related papers (2020-03-08T03:36:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.