Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object
Detection
- URL: http://arxiv.org/abs/2308.15846v1
- Date: Wed, 30 Aug 2023 08:33:13 GMT
- Title: Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object
Detection
- Authors: Yifan Xu, Mengdan Zhang, Xiaoshan Yang, Changsheng Xu
- Abstract summary: We propose a multi-modal contextual knowledge distillation framework, MMC-Det, to transfer the learned contextual knowledge from a teacher fusion transformer to a student detector.
The diverse multi-modal masked language modeling is realized by an object divergence constraint upon traditional multi-modal masked language modeling (MLM)
- Score: 72.36017150922504
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we for the first time explore helpful multi-modal contextual
knowledge to understand novel categories for open-vocabulary object detection
(OVD). The multi-modal contextual knowledge stands for the joint relationship
across regions and words. However, it is challenging to incorporate such
multi-modal contextual knowledge into OVD. The reason is that previous
detection frameworks fail to jointly model multi-modal contextual knowledge, as
object detectors only support vision inputs and no caption description is
provided at test time. To this end, we propose a multi-modal contextual
knowledge distillation framework, MMC-Det, to transfer the learned contextual
knowledge from a teacher fusion transformer with diverse multi-modal masked
language modeling (D-MLM) to a student detector. The diverse multi-modal masked
language modeling is realized by an object divergence constraint upon
traditional multi-modal masked language modeling (MLM), in order to extract
fine-grained region-level visual contexts, which are vital to object detection.
Extensive experiments performed upon various detection datasets show the
effectiveness of our multi-modal context learning strategy, where our approach
well outperforms the recent state-of-the-art methods.
Related papers
- RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks.
Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs.
In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z) - Generative Multi-Modal Knowledge Retrieval with Large Language Models [75.70313858231833]
We propose an innovative end-to-end generative framework for multi-modal knowledge retrieval.
Our framework takes advantage of the fact that large language models (LLMs) can effectively serve as virtual knowledge bases.
We demonstrate significant improvements ranging from 3.0% to 14.6% across all evaluation metrics when compared to strong baselines.
arXiv Detail & Related papers (2024-01-16T08:44:29Z) - WisdoM: Improving Multimodal Sentiment Analysis by Fusing Contextual
World Knowledge [73.76722241704488]
We propose a plug-in framework named WisdoM to leverage the contextual world knowledge induced from the large vision-language models (LVLMs) for enhanced multimodal sentiment analysis.
We show that our approach has substantial improvements over several state-of-the-art methods.
arXiv Detail & Related papers (2024-01-12T16:08:07Z) - UniDoc: A Universal Large Multimodal Model for Simultaneous Text
Detection, Recognition, Spotting and Understanding [93.92313947913831]
We introduce UniDoc, a novel multimodal model equipped with text detection and recognition capabilities.
To the best of our knowledge, this is the first large multimodal model capable of simultaneous text detection, recognition, spotting, and understanding.
arXiv Detail & Related papers (2023-08-19T17:32:34Z) - Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection.
Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering.
We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z) - A Multi-Modal Context Reasoning Approach for Conditional Inference on
Joint Textual and Visual Clues [23.743431157431893]
Conditional inference on joint textual and visual clues is a multi-modal reasoning task.
We propose a Multi-modal Context Reasoning approach, named ModCR.
We conduct extensive experiments on two corresponding data sets and experimental results show significantly improved performance.
arXiv Detail & Related papers (2023-05-08T08:05:40Z) - Multimodality Representation Learning: A Survey on Evolution,
Pretraining and Its Applications [47.501121601856795]
Multimodality Representation Learning is a technique of learning to embed information from different modalities and their correlations.
Cross-modal interaction and complementary information from different modalities are crucial for advanced models to perform any multimodal task.
This survey presents the literature on the evolution and enhancement of deep learning multimodal architectures.
arXiv Detail & Related papers (2023-02-01T11:48:34Z) - Beyond Bounding Box: Multimodal Knowledge Learning for Object Detection [3.785123406103386]
We take advantage of language prompt to introduce effective and unbiased linguistic supervision into object detection.
We propose a new mechanism called multimodal knowledge learning (textbfMKL), which is required to learn knowledge from language supervision.
arXiv Detail & Related papers (2022-05-09T07:03:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.