MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding
- URL: http://arxiv.org/abs/2104.12763v1
- Date: Mon, 26 Apr 2021 17:55:33 GMT
- Title: MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding
- Authors: Aishwarya Kamath, Mannat Singh, Yann LeCun, Ishan Misra, Gabriel
Synnaeve, Nicolas Carion
- Abstract summary: We propose MDETR, an end-to-end modulated detector that detects objects in an image conditioned on a raw text query.
We use a transformer-based architecture to reason jointly over text and image by fusing the two modalities at an early stage of the model.
Our approach can be easily extended for visual question answering, achieving competitive performance on GQA and CLEVR.
- Score: 40.24656027709833
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-modal reasoning systems rely on a pre-trained object detector to
extract regions of interest from the image. However, this crucial module is
typically used as a black box, trained independently of the downstream task and
on a fixed vocabulary of objects and attributes. This makes it challenging for
such systems to capture the long tail of visual concepts expressed in free form
text. In this paper we propose MDETR, an end-to-end modulated detector that
detects objects in an image conditioned on a raw text query, like a caption or
a question. We use a transformer-based architecture to reason jointly over text
and image by fusing the two modalities at an early stage of the model. We
pre-train the network on 1.3M text-image pairs, mined from pre-existing
multi-modal datasets having explicit alignment between phrases in text and
objects in the image. We then fine-tune on several downstream tasks such as
phrase grounding, referring expression comprehension and segmentation,
achieving state-of-the-art results on popular benchmarks. We also investigate
the utility of our model as an object detector on a given label set when
fine-tuned in a few-shot setting. We show that our pre-training approach
provides a way to handle the long tail of object categories which have very few
labelled instances. Our approach can be easily extended for visual question
answering, achieving competitive performance on GQA and CLEVR. The code and
models are available at https://github.com/ashkamath/mdetr.
Related papers
- Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation [27.95875467352853]
We propose a new referring remote sensing image segmentation method, FIANet, that fully exploits the visual and linguistic representations.
The proposed fine-grained image-text alignment module (FIAM) would simultaneously leverage the features of the input image and the corresponding texts.
We evaluate the effectiveness of the proposed methods on two public referring remote sensing datasets including RefSegRS and RRSIS-D.
arXiv Detail & Related papers (2024-09-20T16:45:32Z) - Beyond One-to-One: Rethinking the Referring Image Segmentation [117.53010476628029]
Referring image segmentation aims to segment the target object referred by a natural language expression.
We propose a Dual Multi-Modal Interaction (DMMI) Network, which contains two decoder branches.
In the text-to-image decoder, text embedding is utilized to query the visual feature and localize the corresponding target.
Meanwhile, the image-to-text decoder is implemented to reconstruct the erased entity-phrase conditioned on the visual feature.
arXiv Detail & Related papers (2023-08-26T11:39:22Z) - Advancing Visual Grounding with Scene Knowledge: Benchmark and Method [74.72663425217522]
Visual grounding (VG) aims to establish fine-grained alignment between vision and language.
Most existing VG datasets are constructed using simple description texts.
We propose a novel benchmark of underlineScene underlineKnowledge-guided underlineVisual underlineGrounding.
arXiv Detail & Related papers (2023-07-21T13:06:02Z) - Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection.
Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering.
We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z) - DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and
Grounding [34.078590816368056]
We study the problem of visual grounding by considering both phrase extraction and grounding (PEG)
PEG requires a model to extract phrases from text and locate objects from images simultaneously.
We propose a novel DQ-DETR model, which introduces dual queries to probe different features from image and text.
arXiv Detail & Related papers (2022-11-28T16:30:46Z) - Prompt-Based Multi-Modal Image Segmentation [81.58378196535003]
We propose a system that can generate image segmentations based on arbitrary prompts at test time.
A prompt can be either a text or an image.
We build upon the CLIP model as a backbone which we extend with a transformer-based decoder.
arXiv Detail & Related papers (2021-12-18T21:27:19Z) - Referring Expression Comprehension: A Survey of Methods and Datasets [20.42495629501261]
Referring expression comprehension (REC) aims to localize a target object in an image described by a referring expression phrased in natural language.
We first examine the state of the art by comparing modern approaches to the problem.
We discuss modular architectures and graph-based models that interface with structured graph representation.
arXiv Detail & Related papers (2020-07-19T01:45:02Z) - Expressing Objects just like Words: Recurrent Visual Embedding for
Image-Text Matching [102.62343739435289]
Existing image-text matching approaches infer the similarity of an image-text pair by capturing and aggregating the affinities between the text and each independent object of the image.
We propose a Dual Path Recurrent Neural Network (DP-RNN) which processes images and sentences symmetrically by recurrent neural networks (RNN)
Our model achieves the state-of-the-art performance on Flickr30K dataset and competitive performance on MS-COCO dataset.
arXiv Detail & Related papers (2020-02-20T00:51:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.