Detect Only What You Specify : Object Detection with Linguistic Target
- URL: http://arxiv.org/abs/2211.11572v1
- Date: Fri, 18 Nov 2022 07:28:47 GMT
- Title: Detect Only What You Specify : Object Detection with Linguistic Target
- Authors: Moyuru Yamada
- Abstract summary: We propose Language-Targeted Detector (LTD) for the targeted detection based on a recently proposed Transformer-based detector.
LTD is a encoder-decoder architecture and our conditional decoder allows the model to reason about the encoded image with the textual input as the linguistic context.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Object detection is a computer vision task of predicting a set of bounding
boxes and category labels for each object of interest in a given image. The
category is related to a linguistic symbol such as 'dog' or 'person' and there
should be relationships among them. However the object detector only learns to
classify the categories and does not treat them as the linguistic symbols.
Multi-modal models often use the pre-trained object detector to extract object
features from the image, but the models are separated from the detector and the
extracted visual features does not change with their linguistic input. We
rethink the object detection as a vision-and-language reasoning task. We then
propose targeted detection task, where detection targets are given by a natural
language and the goal of the task is to detect only all the target objects in a
given image. There are no detection if the target is not given. Commonly used
modern object detectors have many hand-designed components like anchor and it
is difficult to fuse the textual inputs into the complex pipeline. We thus
propose Language-Targeted Detector (LTD) for the targeted detection based on a
recently proposed Transformer-based detector. LTD is a encoder-decoder
architecture and our conditional decoder allows the model to reason about the
encoded image with the textual input as the linguistic context. We evaluate
detection performances of LTD on COCO object detection dataset and also show
that our model improves the detection results with the textual input grounding
to the visual object.
Related papers
- Generative Region-Language Pretraining for Open-Ended Object Detection [55.42484781608621]
We propose a framework named GenerateU, which can detect dense objects and generate their names in a free-form way.
Our framework achieves comparable results to the open-vocabulary object detection method GLIP.
arXiv Detail & Related papers (2024-03-15T10:52:39Z) - Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection.
Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering.
We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z) - Exploiting Unlabeled Data with Vision and Language Models for Object
Detection [64.94365501586118]
Building robust and generic object detection frameworks requires scaling to larger label spaces and bigger training datasets.
We propose a novel method that leverages the rich semantics available in recent vision and language models to localize and classify objects in unlabeled images.
We demonstrate the value of the generated pseudo labels in two specific tasks, open-vocabulary detection and semi-supervised object detection.
arXiv Detail & Related papers (2022-07-18T21:47:15Z) - Bridging the Gap Between Object Detection and User Intent via
Query-Modulation [33.967176965675264]
query-modulated detectors show superior performance at detecting objects for a given label of interest.
They can be simultaneously trained to solve for both query-modulated detection and standard object detection.
arXiv Detail & Related papers (2021-06-18T17:47:53Z) - Self-supervised object detection from audio-visual correspondence [101.46794879729453]
We tackle the problem of learning object detectors without supervision.
We do not assume image-level class labels, instead we extract a supervisory signal from audio-visual data.
We show that our method can learn to detect generic objects that go beyond instruments, such as airplanes and cats.
arXiv Detail & Related papers (2021-04-13T17:59:03Z) - Class-agnostic Object Detection [16.97782147401037]
We propose class-agnostic object detection as a new problem that focuses on detecting objects irrespective of their object-classes.
Specifically, the goal is to predict bounding boxes for all objects in an image but not their object-classes.
We propose training and evaluation protocols for benchmarking class-agnostic detectors to advance future research in this domain.
arXiv Detail & Related papers (2020-11-28T19:22:38Z) - Learning Object Detection from Captions via Textual Scene Attributes [70.90708863394902]
We argue that captions contain much richer information about the image, including attributes of objects and their relations.
We present a method that uses the attributes in this "textual scene graph" to train object detectors.
We empirically demonstrate that the resulting model achieves state-of-the-art results on several challenging object detection datasets.
arXiv Detail & Related papers (2020-09-30T10:59:20Z) - Black-box Explanation of Object Detectors via Saliency Maps [66.745167677293]
We propose D-RISE, a method for generating visual explanations for the predictions of object detectors.
We show that D-RISE can be easily applied to different object detectors including one-stage detectors such as YOLOv3 and two-stage detectors such as Faster-RCNN.
arXiv Detail & Related papers (2020-06-05T02:13:35Z) - Detective: An Attentive Recurrent Model for Sparse Object Detection [25.5804429439316]
Detective is an attentive object detector that identifies objects in images in a sequential manner.
Detective is a sparse object detector that generates a single bounding box per object instance.
We propose a training mechanism based on the Hungarian algorithm and a loss that balances the localization and classification tasks.
arXiv Detail & Related papers (2020-04-25T17:41:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.