Detection and Captioning with Unseen Object Classes
- URL: http://arxiv.org/abs/2108.06165v1
- Date: Fri, 13 Aug 2021 10:43:20 GMT
- Title: Detection and Captioning with Unseen Object Classes
- Authors: Berkan Demirel and Ramazan Gokberk Cinbis
- Abstract summary: Test images may contain visual objects with no corresponding visual or textual training examples.
We propose a detection-driven approach based on a generalized zero-shot detection model and a template-based sentence generation model.
Our experiments show that the proposed zero-shot detection model obtains state-of-the-art performance on the MS-COCO dataset.
- Score: 12.894104422808242
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image caption generation is one of the most challenging problems at the
intersection of visual recognition and natural language modeling domains. In
this work, we propose and study a practically important variant of this problem
where test images may contain visual objects with no corresponding visual or
textual training examples. For this problem, we propose a detection-driven
approach based on a generalized zero-shot detection model and a template-based
sentence generation model. In order to improve the detection component, we
jointly define a class-to-class similarity based class representation and a
practical score calibration mechanism. We also propose a novel evaluation
metric that provides complimentary insights to the captioning outputs, by
separately handling the visual and non-visual components of the captions. Our
experiments show that the proposed zero-shot detection model obtains
state-of-the-art performance on the MS-COCO dataset and the zero-shot
captioning approach yields promising results.
Related papers
- Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process.
Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities.
We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
arXiv Detail & Related papers (2024-05-21T18:02:07Z) - A Unified Interactive Model Evaluation for Classification, Object
Detection, and Instance Segmentation in Computer Vision [31.441561710096877]
We develop an open-source visual analysis tool, Uni-Evaluator, to support a unified model evaluation for classification, object detection, and instance segmentation in computer vision.
The key idea behind our method is to formulate both discrete and continuous predictions in different tasks as unified probability distributions.
Based on these distributions, we develop 1) a matrix-based visualization to provide an overview of model performance; 2) a table visualization to identify the problematic data subsets where the model performs poorly; and 3) a grid visualization to display the samples of interest.
arXiv Detail & Related papers (2023-08-09T18:11:28Z) - Zero-shot Model Diagnosis [80.36063332820568]
A common approach to evaluate deep learning models is to build a labeled test set with attributes of interest and assess how well it performs.
This paper argues the case that Zero-shot Model Diagnosis (ZOOM) is possible without the need for a test set nor labeling.
arXiv Detail & Related papers (2023-03-27T17:59:33Z) - Text2Model: Text-based Model Induction for Zero-shot Image Classification [38.704831945753284]
We address the challenge of building task-agnostic classifiers using only text descriptions.
We generate zero-shot classifiers using a hypernetwork that receives class descriptions and outputs a multi-class model.
We evaluate this approach in a series of zero-shot classification tasks, for image, point-cloud, and action recognition, using a range of text descriptions.
arXiv Detail & Related papers (2022-10-27T05:19:55Z) - Perceptual Grouping in Contrastive Vision-Language Models [59.1542019031645]
We show how vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery.
We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information.
arXiv Detail & Related papers (2022-10-18T17:01:35Z) - Exploring CLIP for Assessing the Look and Feel of Images [87.97623543523858]
We introduce Contrastive Language-Image Pre-training (CLIP) models for assessing both the quality perception (look) and abstract perception (feel) of images in a zero-shot manner.
Our results show that CLIP captures meaningful priors that generalize well to different perceptual assessments.
arXiv Detail & Related papers (2022-07-25T17:58:16Z) - A Baseline for Detecting Out-of-Distribution Examples in Image
Captioning [12.953517767147998]
We consider the problem of OOD detection in image captioning.
We show the effectiveness of the caption's likelihood score at detecting and rejecting OOD images.
arXiv Detail & Related papers (2022-07-12T09:29:57Z) - Robust Region Feature Synthesizer for Zero-Shot Object Detection [87.79902339984142]
We build a novel zero-shot object detection framework that contains an Intra-class Semantic Diverging component and an Inter-class Structure Preserving component.
It is the first study to carry out zero-shot object detection in remote sensing imagery.
arXiv Detail & Related papers (2022-01-01T03:09:15Z) - Synthesizing the Unseen for Zero-shot Object Detection [72.38031440014463]
We propose to synthesize visual features for unseen classes, so that the model learns both seen and unseen objects in the visual domain.
We use a novel generative model that uses class-semantics to not only generate the features but also to discriminatively separate them.
arXiv Detail & Related papers (2020-10-19T12:36:11Z) - Image Captioning with Compositional Neural Module Networks [18.27510863075184]
We introduce a hierarchical framework for image captioning that explores both compositionality and sequentiality of natural language.
Our algorithm learns to compose a detail-rich sentence by selectively attending to different modules corresponding to unique aspects of each object detected in an input image.
arXiv Detail & Related papers (2020-07-10T20:58:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.