DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and
Grounding
- URL: http://arxiv.org/abs/2211.15516v2
- Date: Wed, 30 Nov 2022 17:49:14 GMT
- Title: DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and
Grounding
- Authors: Shilong Liu, Yaoyuan Liang, Feng Li, Shijia Huang, Hao Zhang, Hang Su,
Jun Zhu, Lei Zhang
- Abstract summary: We study the problem of visual grounding by considering both phrase extraction and grounding (PEG)
PEG requires a model to extract phrases from text and locate objects from images simultaneously.
We propose a novel DQ-DETR model, which introduces dual queries to probe different features from image and text.
- Score: 34.078590816368056
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we study the problem of visual grounding by considering both
phrase extraction and grounding (PEG). In contrast to the previous
phrase-known-at-test setting, PEG requires a model to extract phrases from text
and locate objects from images simultaneously, which is a more practical
setting in real applications. As phrase extraction can be regarded as a $1$D
text segmentation problem, we formulate PEG as a dual detection problem and
propose a novel DQ-DETR model, which introduces dual queries to probe different
features from image and text for object prediction and phrase mask prediction.
Each pair of dual queries is designed to have shared positional parts but
different content parts. Such a design effectively alleviates the difficulty of
modality alignment between image and text (in contrast to a single query
design) and empowers Transformer decoder to leverage phrase mask-guided
attention to improve performance. To evaluate the performance of PEG, we also
propose a new metric CMAP (cross-modal average precision), analogous to the AP
metric in object detection. The new metric overcomes the ambiguity of Recall@1
in many-box-to-one-phrase cases in phrase grounding. As a result, our PEG
pre-trained DQ-DETR establishes new state-of-the-art results on all visual
grounding benchmarks with a ResNet-101 backbone. For example, it achieves
$91.04\%$ and $83.51\%$ in terms of recall rate on RefCOCO testA and testB with
a ResNet-101 backbone. Code will be availabl at
\url{https://github.com/IDEA-Research/DQ-DETR}.
Related papers
- Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval [55.90407811819347]
We consider the task of paraphrased text-to-image retrieval where a model aims to return similar results given a pair of paraphrased queries.
We train a dual-encoder model starting from a language model pretrained on a large text corpus.
Compared to public dual-encoder models such as CLIP and OpenCLIP, the model trained with our best adaptation strategy achieves a significantly higher ranking similarity for paraphrased queries.
arXiv Detail & Related papers (2024-05-06T06:30:17Z) - Learning from Mistakes: Iterative Prompt Relabeling for Text-to-Image Diffusion Model Training [33.51524424536508]
Iterative Prompt Relabeling (IPR) is a novel algorithm that aligns images to text through iterative image sampling and prompt relabeling with feedback.
We conduct thorough experiments on SDv2 and SDXL, testing their capability to follow instructions on spatial relations.
arXiv Detail & Related papers (2023-12-23T11:10:43Z) - Single-Stage Visual Relationship Learning using Conditional Queries [60.90880759475021]
TraCQ is a new formulation for scene graph generation that avoids the multi-task learning problem and the entity pair distribution.
We employ a DETR-based encoder-decoder conditional queries to significantly reduce the entity label space as well.
Experimental results show that TraCQ not only outperforms existing single-stage scene graph generation methods, it also beats many state-of-the-art two-stage methods on the Visual Genome dataset.
arXiv Detail & Related papers (2023-06-09T06:02:01Z) - DPText-DETR: Towards Better Scene Text Detection with Dynamic Points in
Transformer [94.35116535588332]
Transformer-based methods, which predict polygon points or Bezier curve control points to localize texts, are quite popular in scene text detection.
However, the used point label form implies the reading order of humans, which affects the robustness of Transformer model.
We propose DPText-DETR, which directly uses point coordinates as queries and dynamically updates them between decoder layers.
arXiv Detail & Related papers (2022-07-10T15:45:16Z) - BatchFormerV2: Exploring Sample Relationships for Dense Representation
Learning [88.82371069668147]
BatchFormerV2 is a more general batch Transformer module, which enables exploring sample relationships for dense representation learning.
BatchFormerV2 consistently improves current DETR-based detection methods by over 1.3%.
arXiv Detail & Related papers (2022-04-04T05:53:42Z) - Learning Quality-aware Representation for Multi-person Pose Regression [8.83185608408674]
We learn the pose regression quality-aware representation.
Our method achieves the state-of-the-art result of 71.7 AP on MS COCO test-dev set.
arXiv Detail & Related papers (2022-01-04T11:10:28Z) - MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding [40.24656027709833]
We propose MDETR, an end-to-end modulated detector that detects objects in an image conditioned on a raw text query.
We use a transformer-based architecture to reason jointly over text and image by fusing the two modalities at an early stage of the model.
Our approach can be easily extended for visual question answering, achieving competitive performance on GQA and CLEVR.
arXiv Detail & Related papers (2021-04-26T17:55:33Z) - Detector-Free Weakly Supervised Grounding by Separation [76.65699170882036]
Weakly Supervised phrase-Grounding (WSG) deals with the task of using data to learn to localize arbitrary text phrases in images.
We propose Detector-Free WSG (DF-WSG) to solve WSG without relying on a pre-trained detector.
We demonstrate a significant accuracy improvement, of up to $8.5%$ over previous DF-WSG SotA.
arXiv Detail & Related papers (2021-04-20T08:27:31Z) - End-to-End Object Detection with Transformers [88.06357745922716]
We present a new method that views object detection as a direct set prediction problem.
Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components.
The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss.
arXiv Detail & Related papers (2020-05-26T17:06:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.