DesCo: Learning Object Recognition with Rich Language Descriptions
- URL: http://arxiv.org/abs/2306.14060v1
- Date: Sat, 24 Jun 2023 21:05:02 GMT
- Title: DesCo: Learning Object Recognition with Rich Language Descriptions
- Authors: Liunian Harold Li, Zi-Yi Dou, Nanyun Peng, Kai-Wei Chang
- Abstract summary: Recent development in vision-language approaches has instigated a paradigm shift in learning visual recognition models from language supervision.
We propose a new description-conditioned (DesCo) paradigm of learning object recognition models with rich language descriptions.
- Score: 93.8177229428617
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent development in vision-language approaches has instigated a paradigm
shift in learning visual recognition models from language supervision. These
approaches align objects with language queries (e.g. "a photo of a cat") and
improve the models' adaptability to identify novel objects and domains.
Recently, several studies have attempted to query these models with complex
language expressions that include specifications of fine-grained semantic
details, such as attributes, shapes, textures, and relations. However, simply
incorporating language descriptions as queries does not guarantee accurate
interpretation by the models. In fact, our experiments show that GLIP, the
state-of-the-art vision-language model for object detection, often disregards
contextual information in the language descriptions and instead relies heavily
on detecting objects solely by their names. To tackle the challenges, we
propose a new description-conditioned (DesCo) paradigm of learning object
recognition models with rich language descriptions consisting of two major
innovations: 1) we employ a large language model as a commonsense knowledge
engine to generate rich language descriptions of objects based on object names
and the raw image-text caption; 2) we design context-sensitive queries to
improve the model's ability in deciphering intricate nuances embedded within
descriptions and enforce the model to focus on context rather than object names
alone. On two novel object detection benchmarks, LVIS and OminiLabel, under the
zero-shot detection setting, our approach achieves 34.8 APr minival (+9.1) and
29.3 AP (+3.6), respectively, surpassing the prior state-of-the-art models,
GLIP and FIBER, by a large margin.
Related papers
- Reminding Multimodal Large Language Models of Object-aware Knowledge with Retrieved Tags [28.368960723666458]
Multimodal Large Language Models (MLLMs) struggle with critical problems when required to provide a precise and detailed response to a visual instruction.
We show effectiveness in mitigating these issues, but at an expensive cost of collecting a vast amount of new data.
We propose to enhance the mapping with retrieval-augmented tag tokens, which contain rich object-aware information.
arXiv Detail & Related papers (2024-06-16T08:20:12Z) - Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection.
Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering.
We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z) - Visually-Situated Natural Language Understanding with Contrastive
Reading Model and Frozen Large Language Models [24.456117679941816]
Contrastive Reading Model (Cream) is a novel neural architecture designed to enhance the language-image understanding capability of Large Language Models (LLMs)
Our approach bridges the gap between vision and language understanding, paving the way for the development of more sophisticated Document Intelligence Assistants.
arXiv Detail & Related papers (2023-05-24T11:59:13Z) - Learning to Name Classes for Vision and Language Models [57.0059455405424]
Large scale vision and language models can achieve impressive zero-shot recognition performance by mapping class specific text queries to image content.
We propose to leverage available data to learn, for each class, an optimal word embedding as a function of the visual content.
By learning new word embeddings on an otherwise frozen model, we are able to retain zero-shot capabilities for new classes, easily adapt models to new datasets, and adjust potentially erroneous, non-descriptive or ambiguous class names.
arXiv Detail & Related papers (2023-04-04T14:34:44Z) - Open-Vocabulary Object Detection using Pseudo Caption Labels [3.260777306556596]
We argue that more fine-grained labels are necessary to extract richer knowledge about novel objects.
Our best model trained on the de-duplicated VisualGenome dataset achieves an AP of 34.5 and an APr of 30.6, comparable to the state-of-the-art performance.
arXiv Detail & Related papers (2023-03-23T05:10:22Z) - ABINet++: Autonomous, Bidirectional and Iterative Language Modeling for
Scene Text Spotting [121.11880210592497]
We argue that the limited capacity of language models comes from 1) implicit language modeling; 2) unidirectional feature representation; and 3) language model with noise input.
We propose an autonomous, bidirectional and iterative ABINet++ for scene text spotting.
arXiv Detail & Related papers (2022-11-19T03:50:33Z) - Perceptual Grouping in Contrastive Vision-Language Models [59.1542019031645]
We show how vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery.
We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information.
arXiv Detail & Related papers (2022-10-18T17:01:35Z) - Exploring Multi-Modal Representations for Ambiguity Detection &
Coreference Resolution in the SIMMC 2.0 Challenge [60.616313552585645]
We present models for effective Ambiguity Detection and Coreference Resolution in Conversational AI.
Specifically, we use TOD-BERT and LXMERT based models, compare them to a number of baselines and provide ablation experiments.
Our results show that (1) language models are able to exploit correlations in the data to detect ambiguity; and (2) unimodal coreference resolution models can avoid the need for a vision component.
arXiv Detail & Related papers (2022-02-25T12:10:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.