CapDet: Unifying Dense Captioning and Open-World Detection Pretraining
- URL: http://arxiv.org/abs/2303.02489v1
- Date: Sat, 4 Mar 2023 19:53:00 GMT
- Title: CapDet: Unifying Dense Captioning and Open-World Detection Pretraining
- Authors: Yanxin Long, Youpeng Wen, Jianhua Han, Hang Xu, Pengzhen Ren, Wei
Zhang, Shen Zhao, Xiaodan Liang
- Abstract summary: We propose a novel open-world detector named CapDet to either predict under a given category list or directly generate the category of predicted bounding boxes.
Specifically, we unify the open-world detection and dense caption tasks into a single yet effective framework by introducing an additional dense captioning head.
- Score: 68.8382821890089
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Benefiting from large-scale vision-language pre-training on image-text pairs,
open-world detection methods have shown superior generalization ability under
the zero-shot or few-shot detection settings. However, a pre-defined category
space is still required during the inference stage of existing methods and only
the objects belonging to that space will be predicted. To introduce a "real"
open-world detector, in this paper, we propose a novel method named CapDet to
either predict under a given category list or directly generate the category of
predicted bounding boxes. Specifically, we unify the open-world detection and
dense caption tasks into a single yet effective framework by introducing an
additional dense captioning head to generate the region-grounded captions.
Besides, adding the captioning task will in turn benefit the generalization of
detection performance since the captioning dataset covers more concepts.
Experiment results show that by unifying the dense caption task, our CapDet has
obtained significant performance improvements (e.g., +2.1% mAP on LVIS rare
classes) over the baseline method on LVIS (1203 classes). Besides, our CapDet
also achieves state-of-the-art performance on dense captioning tasks, e.g.,
15.44% mAP on VG V1.2 and 13.98% on the VG-COCO dataset.
Related papers
- Positive-Augmented Contrastive Learning for Vision-and-Language Evaluation and Training [44.008094698200026]
PAC-S++ is a learnable metric that leverages the CLIP model, pre-trained on both web-collected and cleaned data.
We show that integrating PAC-S++ into the fine-tuning stage of a captioning model results in semantically richer captions with fewer repetitions and grammatical errors.
arXiv Detail & Related papers (2024-10-09T18:00:09Z) - TROPE: TRaining-Free Object-Part Enhancement for Seamlessly Improving Fine-Grained Zero-Shot Image Captioning [30.506968671472517]
We introduce TRaining-Free Object-Part Enhancement (TROPE)
TROPE enriches a base caption with additional object-part details using object detector proposals and Natural Language Processing techniques.
Our evaluations show that TROPE consistently boosts performance across all tested zero-shot IC approaches and achieves state-of-the-art results on fine-grained IC datasets.
arXiv Detail & Related papers (2024-09-30T05:24:01Z) - Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via
Text-Only Training [14.340740609933437]
We propose a novel zero-shot image captioning framework with text-only training to reduce the modality gap.
In particular, we introduce a subregion feature aggregation to leverage local region information.
We extend our framework to build a zero-shot VQA pipeline, demonstrating its generality.
arXiv Detail & Related papers (2024-01-04T16:43:46Z) - CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary
Object Detection [78.0010542552784]
CoDet is a novel approach to learn object-level vision-language representations for open-vocabulary object detection.
By grouping images that mention a shared concept in their captions, objects corresponding to the shared concept shall exhibit high co-occurrence.
CoDet has superior performances and compelling scalability in open-vocabulary detection.
arXiv Detail & Related papers (2023-10-25T14:31:02Z) - EdaDet: Open-Vocabulary Object Detection Using Early Dense Alignment [28.983503845298824]
We propose Early Dense Alignment (EDA) to bridge the gap between generalizable local semantics and object-level prediction.
In EDA, we use object-level supervision to learn the dense-level rather than object-level alignment to maintain the local fine-grained semantics.
arXiv Detail & Related papers (2023-09-03T12:04:14Z) - DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only
Training [73.74291217502928]
We propose a simple framework, named DeCap, for zero-shot captioning.
We introduce a lightweight visual-aware language decoder.
We project the visual embedding into the CLIP text embedding space, while the projected embedding retains the information of the visual input.
arXiv Detail & Related papers (2023-03-06T11:02:47Z) - Betrayed by Captions: Joint Caption Grounding and Generation for Open
Vocabulary Instance Segmentation [80.48979302400868]
We focus on open vocabulary instance segmentation to expand a segmentation model to classify and segment instance-level novel categories.
Previous approaches have relied on massive caption datasets and complex pipelines to establish one-to-one mappings between image regions and captions in nouns.
We devise a joint textbfCaption Grounding and Generation (CGG) framework, which incorporates a novel grounding loss that only focuses on matching object to improve learning efficiency.
arXiv Detail & Related papers (2023-01-02T18:52:12Z) - A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained
Vision-language Model [61.58071099082296]
It is unclear how to make zero-shot recognition working well on broader vision problems, such as object detection and semantic segmentation.
In this paper, we target for zero-shot semantic segmentation, by building it on an off-the-shelf pre-trained vision-language model, i.e., CLIP.
Our experimental results show that this simple framework surpasses previous state-of-the-arts by a large margin.
arXiv Detail & Related papers (2021-12-29T18:56:18Z) - Injecting Semantic Concepts into End-to-End Image Captioning [61.41154537334627]
We propose a pure vision transformer-based image captioning model, dubbed as ViTCAP, in which grid representations are used without extracting the regional features.
For improved performance, we introduce a novel Concept Token Network (CTN) to predict the semantic concepts and then incorporate them into the end-to-end captioning.
In particular, the CTN is built on the basis of a vision transformer and is designed to predict the concept tokens through a classification task.
arXiv Detail & Related papers (2021-12-09T22:05:05Z) - Zero-Shot Detection via Vision and Language Knowledge Distillation [28.54818724798105]
We propose ViLD, a training method via Vision and Language knowledge Distillation.
We distill the knowledge from a pre-trained zero-shot image classification model into a two-stage detector.
Our method aligns the region embeddings in the detector to the text and image embeddings inferred by the pre-trained model.
arXiv Detail & Related papers (2021-04-28T17:58:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.