Related papers: The Solution for CVPR2024 Foundational Few-Shot Object Detection Challenge

The Solution for CVPR2024 Foundational Few-Shot Object Detection Challenge

URL: http://arxiv.org/abs/2406.12225v1
Date: Tue, 18 Jun 2024 03:03:02 GMT
Title: The Solution for CVPR2024 Foundational Few-Shot Object Detection Challenge
Authors: Hongpeng Pan, Shifeng Yi, Shouwei Yang, Lei Qi, Bing Hu, Yi Xu, Yang Yang,
Abstract summary: This report introduces an enhanced method for the Foundational Few-Shot Object Detection (FSOD) task, leveraging the vision-language model (VLM) for object detection. We propose the VLM+ framework, which integrates the multimodal large language model (MM-LLM) We use these referential expressions to generate pseudo-labels for all images in the training set and then combine them with the original labeled data to fine-tune the VLM.
Score: 14.330962576584446
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This report introduces an enhanced method for the Foundational Few-Shot Object Detection (FSOD) task, leveraging the vision-language model (VLM) for object detection. However, on specific datasets, VLM may encounter the problem where the detected targets are misaligned with the target concepts of interest. This misalignment hinders the zero-shot performance of VLM and the application of fine-tuning methods based on pseudo-labels. To address this issue, we propose the VLM+ framework, which integrates the multimodal large language model (MM-LLM). Specifically, we use MM-LLM to generate a series of referential expressions for each category. Based on the VLM predictions and the given annotations, we select the best referential expression for each category by matching the maximum IoU. Subsequently, we use these referential expressions to generate pseudo-labels for all images in the training set and then combine them with the original labeled data to fine-tune the VLM. Additionally, we employ iterative pseudo-label generation and optimization to further enhance the performance of the VLM. Our approach achieve 32.56 mAP in the final test.

Related papers

MLLM-Selector: Necessity and Diversity-driven High-Value Data Selection for Enhanced Visual Instruction Tuning [69.7347209018861]
We introduce MLLM-Selector, an automated approach that identifies valuable data for visual instruction tuning. We calculate necessity scores for each sample in the VIT data pool to identify samples pivotal for enhancing model performance. Our findings underscore the importance of mixing necessity and diversity in data choice, leading to the creation of MLLM-Selector.
arXiv Detail & Related papers (2025-03-26T12:42:37Z)
SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Models [74.40683913645731]
Zero-shot multi-label recognition (MLR) with Vision-Language Models (VLMs) faces significant challenges without training data, model tuning, or architectural modifications. Our work proposes a novel solution treating VLMs as black boxes, leveraging scores without training data or ground truth. Analysis of these prompt scores reveals VLM biases and AND''/OR' signal ambiguities, notably that maximum scores are surprisingly suboptimal compared to second-highest scores.
arXiv Detail & Related papers (2025-02-24T07:15:05Z)
Pre-Trained Vision-Language Model Selection and Reuse for Downstream Tasks [48.67303250592189]
This paper proposes a novel paradigm to select and reuse VLM for downstream tasks, called Model Label Learning (MLL) The proposal is highly computationally efficient and growable since the model labeling process is completed target task independent.
arXiv Detail & Related papers (2025-01-30T11:10:46Z)
Membership Inference Attacks against Large Vision-Language Models [40.996912464828696]
Large vision-language models (VLLMs) exhibit promising capabilities for processing multi-modal tasks across various application scenarios. Their emergence also raises significant data security concerns, given the potential inclusion of sensitive information, such as private photos and medical records. Detecting inappropriately used data in VLLMs remains a critical and unresolved issue.
arXiv Detail & Related papers (2024-11-05T08:35:08Z)
GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models [44.82179903133343]
Large Language Models (LLMs) act as implicits for Vision-Langugage Models (VLMs) Our GLOV meta-prompts an LLM with the downstream task description, querying it for suitable VLM prompts. We evaluate our GLOV on 16 diverse datasets using two families of VLMs.
arXiv Detail & Related papers (2024-10-08T15:55:40Z)
MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection [107.15164718585666]
We investigate the root cause of VLMs' biased prediction under the open vocabulary detection context. Our observations lead to a simple yet effective paradigm, coded MarvelOVD, that generates significantly better training targets. Our method outperforms the other state-of-the-arts by significant margins.
arXiv Detail & Related papers (2024-07-31T09:23:57Z)
SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts. We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM. We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z)
Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning [79.38140606606126]
We propose an algorithmic framework that fine-tunes vision-language models (VLMs) with reinforcement learning (RL) Our framework provides a task description and then prompts the VLM to generate chain-of-thought (CoT) reasoning. We demonstrate that our proposed framework enhances the decision-making capabilities of VLM agents across various tasks.
arXiv Detail & Related papers (2024-05-16T17:50:19Z)
Bridge the Modality and Capability Gaps in Vision-Language Model Selection [62.26769826687365]
Vision Language Models (VLMs) excel in zero-shot image classification by pairing images with textual category names. To better reuse the VLM resource, a promising strategy is selecting appropriate Pre-Trained VLMs from the VLM Zoo. We analyze two inherent challenges in assessing the ability of a VLM in this Language-Only VLM selection. We propose VLM Selection With gAp Bridging to mitigate the negative impact of two gaps.
arXiv Detail & Related papers (2024-03-20T17:54:58Z)
MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation. Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results. For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data. For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z)
RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation. Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal. We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.