RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition
- URL: http://arxiv.org/abs/2403.13805v1
- Date: Wed, 20 Mar 2024 17:59:55 GMT
- Title: RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition
- Authors: Ziyu Liu, Zeyi Sun, Yuhang Zang, Wei Li, Pan Zhang, Xiaoyi Dong, Yuanjun Xiong, Dahua Lin, Jiaqi Wang,
- Abstract summary: Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories.
This paper introduces a Retrieving And Ranking augmented method for MLLMs.
Our proposed approach not only addresses the inherent limitations in fine-grained recognition but also preserves the model's comprehensive knowledge base.
- Score: 78.97487780589574
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: CLIP (Contrastive Language-Image Pre-training) uses contrastive learning from noise image-text pairs to excel at recognizing a wide array of candidates, yet its focus on broad associations hinders the precision in distinguishing subtle differences among fine-grained items. Conversely, Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories, thanks to their substantial knowledge from pre-training on web-level corpora. However, the performance of MLLMs declines with an increase in category numbers, primarily due to growing complexity and constraints of limited context window size. To synergize the strengths of both approaches and enhance the few-shot/zero-shot recognition abilities for datasets characterized by extensive and fine-grained vocabularies, this paper introduces RAR, a Retrieving And Ranking augmented method for MLLMs. We initially establish a multi-modal retriever based on CLIP to create and store explicit memory for different categories beyond the immediate context window. During inference, RAR retrieves the top-k similar results from the memory and uses MLLMs to rank and make the final predictions. Our proposed approach not only addresses the inherent limitations in fine-grained recognition but also preserves the model's comprehensive knowledge base, significantly boosting accuracy across a range of vision-language recognition tasks. Notably, our approach demonstrates a significant improvement in performance on 5 fine-grained visual recognition benchmarks, 11 few-shot image recognition datasets, and the 2 object detection datasets under the zero-shot recognition setting.
Related papers
- Visual RAG: Expanding MLLM visual knowledge without fine-tuning [5.341192792319891]
This paper introduces Visual RAG, that synergically combines the MLLMs capability to learn from the context, with a retrieval mechanism.
In this way, the resulting system is not limited to the knowledge extracted from the training data, but can be updated rapidly and easily without fine-tuning.
It greatly reduces the computational costs for improving the model image classification performance, and augments the model knowledge to new visual domains and tasks it was not trained for.
arXiv Detail & Related papers (2025-01-18T17:43:05Z) - Real Classification by Description: Extending CLIP's Limits of Part Attributes Recognition [1.2499537119440243]
We tackle zero shot "real" classification by description, a novel task that evaluates the ability of Vision-Language Models (VLMs) to classify objects based solely on descriptive attributes, excluding object class names.
We release description data for six popular fine-grained benchmarks, which omit object names to encourage genuine zero-shot learning.
We introduce a modified CLIP architecture that leverages multiple resolutions to improve the detection of fine-grained part attributes.
arXiv Detail & Related papers (2024-12-18T15:28:08Z) - CLLMFS: A Contrastive Learning enhanced Large Language Model Framework for Few-Shot Named Entity Recognition [3.695767900907561]
CLLMFS is a Contrastive Learning enhanced Large Language Model framework for Few-Shot Named Entity Recognition.
It integrates Low-Rank Adaptation (LoRA) and contrastive learning mechanisms specifically tailored for few-shot NER.
Our method has achieved state-of-the-art performance improvements on F1-score ranging from 2.58% to 97.74% over existing best-performing methods.
arXiv Detail & Related papers (2024-08-23T04:44:05Z) - Data-free Multi-label Image Recognition via LLM-powered Prompt Tuning [23.671999163027284]
This paper proposes a novel framework for multi-label image recognition without any training data.
It uses knowledge of pre-trained Large Language Model to learn prompts to adapt pretrained Vision-Language Model like CLIP to multilabel classification.
Our framework presents a new way to explore the synergies between multiple pre-trained models for novel category recognition.
arXiv Detail & Related papers (2024-03-02T13:43:32Z) - Generative Cross-Modal Retrieval: Memorizing Images in Multimodal
Language Models for Retrieval and Beyond [99.73306923465424]
We introduce a generative cross-modal retrieval framework, which assigns unique identifier strings to represent images.
By memorizing images in MLLMs, we introduce a new paradigm to cross-modal retrieval, distinct from previous discriminative approaches.
arXiv Detail & Related papers (2024-02-16T16:31:46Z) - LLMs as Visual Explainers: Advancing Image Classification with Evolving
Visual Descriptions [13.546494268784757]
We propose a framework that integrates large language models (LLMs) and vision-language models (VLMs) to find the optimal class descriptors.
Our training-free approach develops an LLM-based agent with an evolutionary optimization strategy to iteratively refine class descriptors.
arXiv Detail & Related papers (2023-11-20T16:37:45Z) - Pink: Unveiling the Power of Referential Comprehension for Multi-modal
LLMs [49.88461345825586]
This paper proposes a new framework to enhance the fine-grained image understanding abilities of MLLMs.
We present a new method for constructing the instruction tuning dataset at a low cost by leveraging annotations in existing datasets.
We show that our model exhibits a 5.2% accuracy improvement over Qwen-VL and surpasses the accuracy of Kosmos-2 by 24.7%.
arXiv Detail & Related papers (2023-10-01T05:53:15Z) - Waffling around for Performance: Visual Classification with Random Words
and Broad Concepts [121.60918966567657]
WaffleCLIP is a framework for zero-shot visual classification which simply replaces LLM-generated descriptors with random character and word descriptors.
We conduct an extensive experimental study on the impact and shortcomings of additional semantics introduced with LLM-generated descriptors.
arXiv Detail & Related papers (2023-06-12T17:59:48Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - Semantic Representation and Dependency Learning for Multi-Label Image
Recognition [76.52120002993728]
We propose a novel and effective semantic representation and dependency learning (SRDL) framework to learn category-specific semantic representation for each category.
Specifically, we design a category-specific attentional regions (CAR) module to generate channel/spatial-wise attention matrices to guide model.
We also design an object erasing (OE) module to implicitly learn semantic dependency among categories by erasing semantic-aware regions.
arXiv Detail & Related papers (2022-04-08T00:55:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.