Related papers: Pix2seq: A Language Modeling Framework for Object Detection

Pix2seq: A Language Modeling Framework for Object Detection

URL: http://arxiv.org/abs/2109.10852v1
Date: Wed, 22 Sep 2021 17:26:36 GMT
Title: Pix2seq: A Language Modeling Framework for Object Detection
Authors: Ting Chen, Saurabh Saxena, Lala Li, David J. Fleet, Geoffrey Hinton
Abstract summary: Pix2Seq is a simple and generic framework for object detection. We train a neural net to perceive the image and generate the desired sequence. Our approach is based mainly on the intuition that if a neural net knows about where and what the objects are, we just need to teach it how to read them out.
Score: 12.788663431798588
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper presents Pix2Seq, a simple and generic framework for object detection. Unlike existing approaches that explicitly integrate prior knowledge about the task, we simply cast object detection as a language modeling task conditioned on the observed pixel inputs. Object descriptions (e.g., bounding boxes and class labels) are expressed as sequences of discrete tokens, and we train a neural net to perceive the image and generate the desired sequence. Our approach is based mainly on the intuition that if a neural net knows about where and what the objects are, we just need to teach it how to read them out. Beyond the use of task-specific data augmentations, our approach makes minimal assumptions about the task, yet it achieves competitive results on the challenging COCO dataset, compared to highly specialized and well optimized detection algorithms.

Related papers

From classical techniques to convolution-based models: A review of object detection algorithms [0.562479170374811]
Object detection is a fundamental task in computer vision and image understanding. Traditional methods, which relied on handcrafted features and shallow models, struggled with complex visual data and showed limited performance. Deep learning, especially Convolutional Neural Networks (CNNs), addressed these limitations by automatically learning rich, hierarchical features directly from data.
arXiv Detail & Related papers (2024-12-06T18:32:54Z)
Language-Conditioned Observation Models for Visual Object Search [12.498575839909334]
We bridge the gap in realistic object search by posing the problem as a partially observable Markov decision process (POMDP) We incorporate the neural network's outputs into our language-conditioned observation model (LCOM) to represent dynamically changing sensor noise. We demonstrate our method on a Boston Dynamics Spot robot, enabling it to handle complex natural language object descriptions and efficiently find objects in a room-scale environment.
arXiv Detail & Related papers (2023-09-13T19:30:53Z)
CoTDet: Affordance Knowledge Prompting for Task Driven Object Detection [42.2847114428716]
Task driven object detection aims to detect object instances suitable for affording a task in an image. Its challenge lies in object categories available for the task being too diverse to be limited to a closed set of object vocabulary for traditional object detection. We propose to explore fundamental affordances rather than object categories, i.e., common attributes that enable different objects to accomplish the same task.
arXiv Detail & Related papers (2023-09-03T06:18:39Z)
Uncertainty Aware Active Learning for Reconfiguration of Pre-trained Deep Object-Detection Networks for New Target Domains [0.0]
Object detection is one of the most important and fundamental aspects of computer vision tasks. To obtain training data for object detection model efficiently, many datasets opt to obtain their unannotated data in video format. Annotating every frame from a video is costly and inefficient since many frames contain very similar information for the model to learn from. In this paper, we proposed a novel active learning algorithm for object detection models to tackle this problem.
arXiv Detail & Related papers (2023-03-22T17:14:10Z)
Object Detection in Aerial Images with Uncertainty-Aware Graph Network [61.02591506040606]
We propose a novel uncertainty-aware object detection framework with a structured-graph, where nodes and edges are denoted by objects. We refer to our model as Uncertainty-Aware Graph network for object DETection (UAGDet)
arXiv Detail & Related papers (2022-08-23T07:29:03Z)
Learning Co-segmentation by Segment Swapping for Retrieval and Discovery [67.6609943904996]
The goal of this work is to efficiently identify visually similar patterns from a pair of images. We generate synthetic training pairs by selecting object segments in an image and copy-pasting them into another image. We show our approach provides clear improvements for artwork details retrieval on the Brueghel dataset.
arXiv Detail & Related papers (2021-10-29T16:51:16Z)
Aligning Pretraining for Detection via Object-Level Contrastive Learning [57.845286545603415]
Image-level contrastive representation learning has proven to be highly effective as a generic model for transfer learning. We argue that this could be sub-optimal and thus advocate a design principle which encourages alignment between the self-supervised pretext task and the downstream task. Our method, called Selective Object COntrastive learning (SoCo), achieves state-of-the-art results for transfer performance on COCO detection.
arXiv Detail & Related papers (2021-06-04T17:59:52Z)
Synthesizing the Unseen for Zero-shot Object Detection [72.38031440014463]
We propose to synthesize visual features for unseen classes, so that the model learns both seen and unseen objects in the visual domain. We use a novel generative model that uses class-semantics to not only generate the features but also to discriminatively separate them.
arXiv Detail & Related papers (2020-10-19T12:36:11Z)
Learning Object Detection from Captions via Textual Scene Attributes [70.90708863394902]
We argue that captions contain much richer information about the image, including attributes of objects and their relations. We present a method that uses the attributes in this "textual scene graph" to train object detectors. We empirically demonstrate that the resulting model achieves state-of-the-art results on several challenging object detection datasets.
arXiv Detail & Related papers (2020-09-30T10:59:20Z)
Referring Expression Comprehension: A Survey of Methods and Datasets [20.42495629501261]
Referring expression comprehension (REC) aims to localize a target object in an image described by a referring expression phrased in natural language. We first examine the state of the art by comparing modern approaches to the problem. We discuss modular architectures and graph-based models that interface with structured graph representation.
arXiv Detail & Related papers (2020-07-19T01:45:02Z)
Exploiting Structured Knowledge in Text via Graph-Guided Representation Learning [73.0598186896953]
We present two self-supervised tasks learning over raw text with the guidance from knowledge graphs. Building upon entity-level masked language models, our first contribution is an entity masking scheme. In contrast to existing paradigms, our approach uses knowledge graphs implicitly, only during pre-training.
arXiv Detail & Related papers (2020-04-29T14:22:42Z)
Detective: An Attentive Recurrent Model for Sparse Object Detection [25.5804429439316]
Detective is an attentive object detector that identifies objects in images in a sequential manner. Detective is a sparse object detector that generates a single bounding box per object instance. We propose a training mechanism based on the Hungarian algorithm and a loss that balances the localization and classification tasks.
arXiv Detail & Related papers (2020-04-25T17:41:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.