Pix2seq: A Language Modeling Framework for Object Detection
- URL: http://arxiv.org/abs/2109.10852v1
- Date: Wed, 22 Sep 2021 17:26:36 GMT
- Title: Pix2seq: A Language Modeling Framework for Object Detection
- Authors: Ting Chen, Saurabh Saxena, Lala Li, David J. Fleet, Geoffrey Hinton
- Abstract summary: Pix2Seq is a simple and generic framework for object detection.
We train a neural net to perceive the image and generate the desired sequence.
Our approach is based mainly on the intuition that if a neural net knows about where and what the objects are, we just need to teach it how to read them out.
- Score: 12.788663431798588
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents Pix2Seq, a simple and generic framework for object
detection. Unlike existing approaches that explicitly integrate prior knowledge
about the task, we simply cast object detection as a language modeling task
conditioned on the observed pixel inputs. Object descriptions (e.g., bounding
boxes and class labels) are expressed as sequences of discrete tokens, and we
train a neural net to perceive the image and generate the desired sequence. Our
approach is based mainly on the intuition that if a neural net knows about
where and what the objects are, we just need to teach it how to read them out.
Beyond the use of task-specific data augmentations, our approach makes minimal
assumptions about the task, yet it achieves competitive results on the
challenging COCO dataset, compared to highly specialized and well optimized
detection algorithms.
Related papers
- Language-Conditioned Observation Models for Visual Object Search [12.498575839909334]
We bridge the gap in realistic object search by posing the problem as a partially observable Markov decision process (POMDP)
We incorporate the neural network's outputs into our language-conditioned observation model (LCOM) to represent dynamically changing sensor noise.
We demonstrate our method on a Boston Dynamics Spot robot, enabling it to handle complex natural language object descriptions and efficiently find objects in a room-scale environment.
arXiv Detail & Related papers (2023-09-13T19:30:53Z) - CoTDet: Affordance Knowledge Prompting for Task Driven Object Detection [42.2847114428716]
Task driven object detection aims to detect object instances suitable for affording a task in an image.
Its challenge lies in object categories available for the task being too diverse to be limited to a closed set of object vocabulary for traditional object detection.
We propose to explore fundamental affordances rather than object categories, i.e., common attributes that enable different objects to accomplish the same task.
arXiv Detail & Related papers (2023-09-03T06:18:39Z) - Uncertainty Aware Active Learning for Reconfiguration of Pre-trained
Deep Object-Detection Networks for New Target Domains [0.0]
Object detection is one of the most important and fundamental aspects of computer vision tasks.
To obtain training data for object detection model efficiently, many datasets opt to obtain their unannotated data in video format.
Annotating every frame from a video is costly and inefficient since many frames contain very similar information for the model to learn from.
In this paper, we proposed a novel active learning algorithm for object detection models to tackle this problem.
arXiv Detail & Related papers (2023-03-22T17:14:10Z) - Object Detection in Aerial Images with Uncertainty-Aware Graph Network [61.02591506040606]
We propose a novel uncertainty-aware object detection framework with a structured-graph, where nodes and edges are denoted by objects.
We refer to our model as Uncertainty-Aware Graph network for object DETection (UAGDet)
arXiv Detail & Related papers (2022-08-23T07:29:03Z) - Learning Co-segmentation by Segment Swapping for Retrieval and Discovery [67.6609943904996]
The goal of this work is to efficiently identify visually similar patterns from a pair of images.
We generate synthetic training pairs by selecting object segments in an image and copy-pasting them into another image.
We show our approach provides clear improvements for artwork details retrieval on the Brueghel dataset.
arXiv Detail & Related papers (2021-10-29T16:51:16Z) - Aligning Pretraining for Detection via Object-Level Contrastive Learning [57.845286545603415]
Image-level contrastive representation learning has proven to be highly effective as a generic model for transfer learning.
We argue that this could be sub-optimal and thus advocate a design principle which encourages alignment between the self-supervised pretext task and the downstream task.
Our method, called Selective Object COntrastive learning (SoCo), achieves state-of-the-art results for transfer performance on COCO detection.
arXiv Detail & Related papers (2021-06-04T17:59:52Z) - Synthesizing the Unseen for Zero-shot Object Detection [72.38031440014463]
We propose to synthesize visual features for unseen classes, so that the model learns both seen and unseen objects in the visual domain.
We use a novel generative model that uses class-semantics to not only generate the features but also to discriminatively separate them.
arXiv Detail & Related papers (2020-10-19T12:36:11Z) - Learning Object Detection from Captions via Textual Scene Attributes [70.90708863394902]
We argue that captions contain much richer information about the image, including attributes of objects and their relations.
We present a method that uses the attributes in this "textual scene graph" to train object detectors.
We empirically demonstrate that the resulting model achieves state-of-the-art results on several challenging object detection datasets.
arXiv Detail & Related papers (2020-09-30T10:59:20Z) - Referring Expression Comprehension: A Survey of Methods and Datasets [20.42495629501261]
Referring expression comprehension (REC) aims to localize a target object in an image described by a referring expression phrased in natural language.
We first examine the state of the art by comparing modern approaches to the problem.
We discuss modular architectures and graph-based models that interface with structured graph representation.
arXiv Detail & Related papers (2020-07-19T01:45:02Z) - Exploiting Structured Knowledge in Text via Graph-Guided Representation
Learning [73.0598186896953]
We present two self-supervised tasks learning over raw text with the guidance from knowledge graphs.
Building upon entity-level masked language models, our first contribution is an entity masking scheme.
In contrast to existing paradigms, our approach uses knowledge graphs implicitly, only during pre-training.
arXiv Detail & Related papers (2020-04-29T14:22:42Z) - Detective: An Attentive Recurrent Model for Sparse Object Detection [25.5804429439316]
Detective is an attentive object detector that identifies objects in images in a sequential manner.
Detective is a sparse object detector that generates a single bounding box per object instance.
We propose a training mechanism based on the Hungarian algorithm and a loss that balances the localization and classification tasks.
arXiv Detail & Related papers (2020-04-25T17:41:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.