Large Model Based Referring Camouflaged Object Detection
- URL: http://arxiv.org/abs/2311.17122v1
- Date: Tue, 28 Nov 2023 13:45:09 GMT
- Title: Large Model Based Referring Camouflaged Object Detection
- Authors: Shupeng Cheng, Ge-Peng Ji, Pengda Qin, Deng-Ping Fan, Bowen Zhou, Peng
Xu
- Abstract summary: Referring camouflaged object detection (Ref-COD) is a recently-proposed problem aiming to segment out specified camouflaged objects matched with a textual or visual reference.
Our motivation is to make full use of the semantic intelligence and intrinsic knowledge of recent Multimodal Large Language Models (MLLMs) to decompose this complex task in a human-like way.
We propose a large-model-based Multi-Level Knowledge-Guided multimodal method for Ref-COD termed MLKG.
- Score: 51.80619142347807
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Referring camouflaged object detection (Ref-COD) is a recently-proposed
problem aiming to segment out specified camouflaged objects matched with a
textual or visual reference. This task involves two major challenges: the COD
domain-specific perception and multimodal reference-image alignment. Our
motivation is to make full use of the semantic intelligence and intrinsic
knowledge of recent Multimodal Large Language Models (MLLMs) to decompose this
complex task in a human-like way. As language is highly condensed and
inductive, linguistic expression is the main media of human knowledge learning,
and the transmission of knowledge information follows a multi-level progression
from simplicity to complexity. In this paper, we propose a large-model-based
Multi-Level Knowledge-Guided multimodal method for Ref-COD termed MLKG, where
multi-level knowledge descriptions from MLLM are organized to guide the large
vision model of segmentation to perceive the camouflage-targets and
camouflage-scene progressively and meanwhile deeply align the textual
references with camouflaged photos. To our knowledge, our contributions mainly
include: (1) This is the first time that the MLLM knowledge is studied for
Ref-COD and COD. (2) We, for the first time, propose decomposing Ref-COD into
two main perspectives of perceiving the target and scene by integrating MLLM
knowledge, and contribute a multi-level knowledge-guided method. (3) Our method
achieves the state-of-the-art on the Ref-COD benchmark outperforming numerous
strong competitors. Moreover, thanks to the injected rich knowledge, it
demonstrates zero-shot generalization ability on uni-modal COD datasets. We
will release our code soon.
Related papers
- EAGLE: Towards Efficient Arbitrary Referring Visual Prompts Comprehension for Multimodal Large Language Models [80.00303150568696]
We propose a novel Multimodal Large Language Models (MLLM) that empowers comprehension of arbitrary referring visual prompts with less training efforts than existing approaches.
Our approach embeds referring visual prompts as spatial concepts conveying specific spatial areas comprehensible to the MLLM.
We also propose a Geometry-Agnostic Learning paradigm (GAL) to further disentangle the MLLM's region-level comprehension with the specific formats of referring visual prompts.
arXiv Detail & Related papers (2024-09-25T08:22:00Z) - A Survey of Camouflaged Object Detection and Beyond [19.418542867580374]
Camouflaged Object Detection (COD) refers to the task of identifying and segmenting objects that blend seamlessly into their surroundings.
In recent years, COD has garnered widespread attention due to its potential applications in surveillance, wildlife conservation, autonomous systems, and more.
This paper explores various COD methods across four domains, including both image-level and video-level solutions.
arXiv Detail & Related papers (2024-08-26T18:23:22Z) - Generative Multi-Modal Knowledge Retrieval with Large Language Models [75.70313858231833]
We propose an innovative end-to-end generative framework for multi-modal knowledge retrieval.
Our framework takes advantage of the fact that large language models (LLMs) can effectively serve as virtual knowledge bases.
We demonstrate significant improvements ranging from 3.0% to 14.6% across all evaluation metrics when compared to strong baselines.
arXiv Detail & Related papers (2024-01-16T08:44:29Z) - LION : Empowering Multimodal Large Language Model with Dual-Level Visual
Knowledge [58.82222646803248]
Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability to perceive and understand multi-modal signals.
Most of the existing MLLMs mainly adopt vision encoders pretrained on coarsely aligned image-text pairs, leading to insufficient extraction and reasoning of visual knowledge.
We propose a dual-Level vIsual knedgeOwl eNhanced Multimodal Large Language Model (LION), which empowers the MLLM by injecting visual knowledge in two levels.
arXiv Detail & Related papers (2023-11-20T15:56:44Z) - Knowledge Plugins: Enhancing Large Language Models for Domain-Specific
Recommendations [50.81844184210381]
We propose a general paradigm that augments large language models with DOmain-specific KnowledgE to enhance their performance on practical applications, namely DOKE.
This paradigm relies on a domain knowledge extractor, working in three steps: 1) preparing effective knowledge for the task; 2) selecting the knowledge for each specific sample; and 3) expressing the knowledge in an LLM-understandable way.
arXiv Detail & Related papers (2023-11-16T07:09:38Z) - Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object
Detection [72.36017150922504]
We propose a multi-modal contextual knowledge distillation framework, MMC-Det, to transfer the learned contextual knowledge from a teacher fusion transformer to a student detector.
The diverse multi-modal masked language modeling is realized by an object divergence constraint upon traditional multi-modal masked language modeling (MLM)
arXiv Detail & Related papers (2023-08-30T08:33:13Z) - Beyond Bounding Box: Multimodal Knowledge Learning for Object Detection [3.785123406103386]
We take advantage of language prompt to introduce effective and unbiased linguistic supervision into object detection.
We propose a new mechanism called multimodal knowledge learning (textbfMKL), which is required to learn knowledge from language supervision.
arXiv Detail & Related papers (2022-05-09T07:03:30Z) - Towards Accurate Camouflaged Object Detection with Mixture Convolution and Interactive Fusion [45.45231015502287]
We propose a novel deep learning based COD approach, which integrates the large receptive field and effective feature fusion into a unified framework.
Our method detects camouflaged objects with an effective fusion strategy, which aggregates the rich context information from a large receptive field.
arXiv Detail & Related papers (2021-01-14T16:06:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.