Related papers: Captioning Images with Novel Objects via Online Vocabulary Expansion

Captioning Images with Novel Objects via Online Vocabulary Expansion

URL: http://arxiv.org/abs/2003.03305v1
Date: Fri, 6 Mar 2020 16:34:15 GMT
Title: Captioning Images with Novel Objects via Online Vocabulary Expansion
Authors: Mikihiro Tanaka, Tatsuya Harada
Abstract summary: We introduce a low cost method for generating descriptions from images containing novel objects. We propose a method that can explain images with novel objects without retraining using the word embeddings of the objects estimated from only a small number of image features.
Score: 62.525165808406626
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this study, we introduce a low cost method for generating descriptions from images containing novel objects. Generally, constructing a model, which can explain images with novel objects, is costly because of the following: (1) collecting a large amount of data for each category, and (2) retraining the entire system. If humans see a small number of novel objects, they are able to estimate their properties by associating their appearance with known objects. Accordingly, we propose a method that can explain images with novel objects without retraining using the word embeddings of the objects estimated from only a small number of image features of the objects. The method can be integrated with general image-captioning models. The experimental results show the effectiveness of our approach.

Related papers

Generalizable Single-view Object Pose Estimation by Two-side Generating and Matching [19.730504197461144]
We present a novel generalizable object pose estimation method to determine the object pose using only one RGB image. Our method offers generalization to unseen objects without extensive training, operates with a single reference image of the object, and eliminates the need for 3D object models or multiple views of the object.
arXiv Detail & Related papers (2024-11-24T14:31:50Z)
Object-Driven One-Shot Fine-tuning of Text-to-Image Diffusion with Prototypical Embedding [7.893308498886083]
Our proposed method aims to address the challenges of generalizability and fidelity in an object-driven way. A prototypical embedding is based on the object's appearance and its class, before fine-tuning the diffusion model. Our method outperforms several existing works.
arXiv Detail & Related papers (2024-01-28T17:11:42Z)
MegaPose: 6D Pose Estimation of Novel Objects via Render & Compare [84.80956484848505]
MegaPose is a method to estimate the 6D pose of novel objects, that is, objects unseen during training. We present a 6D pose refiner based on a render&compare strategy which can be applied to novel objects. Second, we introduce a novel approach for coarse pose estimation which leverages a network trained to classify whether the pose error between a synthetic rendering and an observed image of the same object can be corrected by the refiner.
arXiv Detail & Related papers (2022-12-13T19:30:03Z)
Automatic dataset generation for specific object detection [6.346581421948067]
We present a method to synthesize object-in-scene images, which can preserve the objects' detailed features without bringing irrelevant information. Our result shows that in the synthesized image, the boundaries of objects blend very well with the background.
arXiv Detail & Related papers (2022-07-16T07:44:33Z)
ShaRF: Shape-conditioned Radiance Fields from a Single View [54.39347002226309]
We present a method for estimating neural scenes representations of objects given only a single image. The core of our method is the estimation of a geometric scaffold for the object. We demonstrate in several experiments the effectiveness of our approach in both synthetic and real images.
arXiv Detail & Related papers (2021-02-17T16:40:28Z)
Continuous Surface Embeddings [76.86259029442624]
We focus on the task of learning and representing dense correspondences in deformable object categories. We propose a new, learnable image-based representation of dense correspondences. We demonstrate that the proposed approach performs on par or better than the state-of-the-art methods for dense pose estimation for humans.
arXiv Detail & Related papers (2020-11-24T22:52:15Z)
Learning Object Detection from Captions via Textual Scene Attributes [70.90708863394902]
We argue that captions contain much richer information about the image, including attributes of objects and their relations. We present a method that uses the attributes in this "textual scene graph" to train object detectors. We empirically demonstrate that the resulting model achieves state-of-the-art results on several challenging object detection datasets.
arXiv Detail & Related papers (2020-09-30T10:59:20Z)
Neural Object Learning for 6D Pose Estimation Using a Few Cluttered Images [30.240630713652035]
Recent methods for 6D pose estimation of objects assume either textured 3D models or real images that cover the entire range of target poses. This paper proposes a method, Neural Object Learning (NOL), that creates synthetic images of objects in arbitrary poses by combining only a few observations from cluttered images.
arXiv Detail & Related papers (2020-05-07T19:33:06Z)
Object-Centric Image Generation from Layouts [93.10217725729468]
We develop a layout-to-image-generation method to generate complex scenes with multiple objects. Our method learns representations of the spatial relationships between objects in the scene, which lead to our model's improved layout-fidelity. We introduce SceneFID, an object-centric adaptation of the popular Fr'echet Inception Distance metric, that is better suited for multi-object images.
arXiv Detail & Related papers (2020-03-16T21:40:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.