The Curious Layperson: Fine-Grained Image Recognition without Expert
Labels
- URL: http://arxiv.org/abs/2111.03651v1
- Date: Fri, 5 Nov 2021 17:58:37 GMT
- Title: The Curious Layperson: Fine-Grained Image Recognition without Expert
Labels
- Authors: Subhabrata Choudhury, Iro Laina, Christian Rupprecht, Andrea Vedaldi
- Abstract summary: We consider a new problem: fine-grained image recognition without expert annotations.
We learn a model to describe the visual appearance of objects using non-expert image descriptions.
We then train a fine-grained textual similarity model that matches image descriptions with documents on a sentence-level basis.
- Score: 90.88501867321573
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most of us are not experts in specific fields, such as ornithology.
Nonetheless, we do have general image and language understanding capabilities
that we use to match what we see to expert resources. This allows us to expand
our knowledge and perform novel tasks without ad-hoc external supervision. On
the contrary, machines have a much harder time consulting expert-curated
knowledge bases unless trained specifically with that knowledge in mind. Thus,
in this paper we consider a new problem: fine-grained image recognition without
expert annotations, which we address by leveraging the vast knowledge available
in web encyclopedias. First, we learn a model to describe the visual appearance
of objects using non-expert image descriptions. We then train a fine-grained
textual similarity model that matches image descriptions with documents on a
sentence-level basis. We evaluate the method on two datasets and compare with
several strong baselines and the state of the art in cross-modal retrieval.
Code is available at: https://github.com/subhc/clever
Related papers
- Assistive Image Annotation Systems with Deep Learning and Natural Language Capabilities: A Review [0.0]
This paper explores AI-assistive deep learning image annotation systems that provide textual suggestions, captions, or descriptions of the input image to the annotator.
We review various datasets and how they contribute to the training and evaluation of AI-assistive annotation systems.
Despite the promising potential, there is limited publicly available work on AI-assistive image annotation with textual output capabilities.
arXiv Detail & Related papers (2024-06-28T22:56:17Z) - Decoupled Textual Embeddings for Customized Image Generation [62.98933630971543]
Customized text-to-image generation aims to learn user-specified concepts with a few images.
Existing methods usually suffer from overfitting issues and entangle the subject-unrelated information with the learned concept.
We propose the DETEX, a novel approach that learns the disentangled concept embedding for flexible customized text-to-image generation.
arXiv Detail & Related papers (2023-12-19T03:32:10Z) - Knowledge Mining with Scene Text for Fine-Grained Recognition [53.74297368412834]
We propose an end-to-end trainable network that mines implicit contextual knowledge behind scene text image.
We employ KnowBert to retrieve relevant knowledge for semantic representation and combine it with image features for fine-grained classification.
Our method outperforms the state-of-the-art by 3.72% mAP and 5.39% mAP, respectively.
arXiv Detail & Related papers (2022-03-27T05:54:00Z) - One-shot Scene Graph Generation [130.57405850346836]
We propose Multiple Structured Knowledge (Relational Knowledgesense Knowledge) for the one-shot scene graph generation task.
Our method significantly outperforms existing state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-02-22T11:32:59Z) - External Knowledge Augmented Text Visual Question Answering [0.6445605125467573]
We propose a framework to extract, filter, and encode knowledge atop a standard multimodal transformer for vision language understanding tasks.
We generate results comparable to the state-of-the-art on two publicly available datasets.
arXiv Detail & Related papers (2021-08-22T13:21:58Z) - Interpretable Visual Understanding with Cognitive Attention Network [20.991018495051623]
We propose a novel Cognitive Attention Network (CAN) for visual commonsense reasoning.
We first introduce an image-text fusion module to fuse information from images and text collectively.
Second, a novel inference module is designed to encode commonsense among image, query and response.
arXiv Detail & Related papers (2021-08-06T02:57:43Z) - Boosting Entity-aware Image Captioning with Multi-modal Knowledge Graph [96.95815946327079]
It is difficult to learn the association between named entities and visual cues due to the long-tail distribution of named entities.
We propose a novel approach that constructs a multi-modal knowledge graph to associate the visual objects with named entities.
arXiv Detail & Related papers (2021-07-26T05:50:41Z) - Hierarchical Semantic Segmentation using Psychometric Learning [17.417302703539367]
We develop a novel approach to collect segmentation annotations from experts based on psychometric testing.
Our method consists of the psychometric testing procedure, active query selection, query enhancement, and a deep metric learning model.
We show the merits of our method with evaluation on the synthetically generated image, aerial image and histology image.
arXiv Detail & Related papers (2021-07-07T13:38:33Z) - Learning Multimodal Affinities for Textual Editing in Images [18.7418059568887]
We devise a generic unsupervised technique to learn multimodal affinities between textual entities in a document-image.
We then use these learned affinities to automatically cluster the textual entities in the image into different semantic groups.
We show that our technique can operate on highly varying images spanning a wide range of documents and demonstrate its applicability for various editing operations.
arXiv Detail & Related papers (2021-03-18T10:09:57Z) - This is not the Texture you are looking for! Introducing Novel
Counterfactual Explanations for Non-Experts using Generative Adversarial
Learning [59.17685450892182]
counterfactual explanation systems try to enable a counterfactual reasoning by modifying the input image.
We present a novel approach to generate such counterfactual image explanations based on adversarial image-to-image translation techniques.
Our results show that our approach leads to significantly better results regarding mental models, explanation satisfaction, trust, emotions, and self-efficacy than two state-of-the art systems.
arXiv Detail & Related papers (2020-12-22T10:08:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.