The Solution for CVPR2024 Foundational Few-Shot Object Detection Challenge
- URL: http://arxiv.org/abs/2406.12225v1
- Date: Tue, 18 Jun 2024 03:03:02 GMT
- Title: The Solution for CVPR2024 Foundational Few-Shot Object Detection Challenge
- Authors: Hongpeng Pan, Shifeng Yi, Shouwei Yang, Lei Qi, Bing Hu, Yi Xu, Yang Yang,
- Abstract summary: This report introduces an enhanced method for the Foundational Few-Shot Object Detection (FSOD) task, leveraging the vision-language model (VLM) for object detection.
We propose the VLM+ framework, which integrates the multimodal large language model (MM-LLM)
We use these referential expressions to generate pseudo-labels for all images in the training set and then combine them with the original labeled data to fine-tune the VLM.
- Score: 14.330962576584446
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This report introduces an enhanced method for the Foundational Few-Shot Object Detection (FSOD) task, leveraging the vision-language model (VLM) for object detection. However, on specific datasets, VLM may encounter the problem where the detected targets are misaligned with the target concepts of interest. This misalignment hinders the zero-shot performance of VLM and the application of fine-tuning methods based on pseudo-labels. To address this issue, we propose the VLM+ framework, which integrates the multimodal large language model (MM-LLM). Specifically, we use MM-LLM to generate a series of referential expressions for each category. Based on the VLM predictions and the given annotations, we select the best referential expression for each category by matching the maximum IoU. Subsequently, we use these referential expressions to generate pseudo-labels for all images in the training set and then combine them with the original labeled data to fine-tune the VLM. Additionally, we employ iterative pseudo-label generation and optimization to further enhance the performance of the VLM. Our approach achieve 32.56 mAP in the final test.
Related papers
- Membership Inference Attacks against Large Vision-Language Models [40.996912464828696]
Large vision-language models (VLLMs) exhibit promising capabilities for processing multi-modal tasks across various application scenarios.
Their emergence also raises significant data security concerns, given the potential inclusion of sensitive information, such as private photos and medical records.
Detecting inappropriately used data in VLLMs remains a critical and unresolved issue.
arXiv Detail & Related papers (2024-11-05T08:35:08Z) - GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models [44.82179903133343]
Large Language Models (LLMs) act as implicits for Vision-Langugage Models (VLMs)
Our GLOV meta-prompts an LLM with the downstream task description, querying it for suitable VLM prompts.
We evaluate our GLOV on 16 diverse datasets using two families of VLMs.
arXiv Detail & Related papers (2024-10-08T15:55:40Z) - MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection [107.15164718585666]
We investigate the root cause of VLMs' biased prediction under the open vocabulary detection context.
Our observations lead to a simple yet effective paradigm, coded MarvelOVD, that generates significantly better training targets.
Our method outperforms the other state-of-the-arts by significant margins.
arXiv Detail & Related papers (2024-07-31T09:23:57Z) - SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning [79.38140606606126]
We propose an algorithmic framework that fine-tunes vision-language models (VLMs) with reinforcement learning (RL)
Our framework provides a task description and then prompts the VLM to generate chain-of-thought (CoT) reasoning.
We demonstrate that our proposed framework enhances the decision-making capabilities of VLM agents across various tasks.
arXiv Detail & Related papers (2024-05-16T17:50:19Z) - Bridge the Modality and Capability Gaps in Vision-Language Model Selection [62.26769826687365]
Vision Language Models (VLMs) excel in zero-shot image classification by pairing images with textual category names.
To better reuse the VLM resource, a promising strategy is selecting appropriate Pre-Trained VLMs from the VLM Zoo.
We analyze two inherent challenges in assessing the ability of a VLM in this Language-Only VLM selection.
We propose VLM Selection With gAp Bridging to mitigate the negative impact of two gaps.
arXiv Detail & Related papers (2024-03-20T17:54:58Z) - MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation.
Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results.
For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data.
For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z) - RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation.
Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal.
We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.