Exploiting LMM-based knowledge for image classification tasks
- URL: http://arxiv.org/abs/2406.03071v1
- Date: Wed, 5 Jun 2024 08:56:24 GMT
- Title: Exploiting LMM-based knowledge for image classification tasks
- Authors: Maria Tzelepi, Vasileios Mezaris,
- Abstract summary: We use the MiniGPT-4 model to extract semantic descriptions for the images.
In this paper, we propose to additionally use the text encoder to obtain the text embeddings corresponding to the MiniGPT-4-generated semantic descriptions.
The experimental evaluation on three datasets validates the improved classification performance achieved by exploiting LMM-based knowledge.
- Score: 11.801596051153725
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper we address image classification tasks leveraging knowledge encoded in Large Multimodal Models (LMMs). More specifically, we use the MiniGPT-4 model to extract semantic descriptions for the images, in a multimodal prompting fashion. In the current literature, vision language models such as CLIP, among other approaches, are utilized as feature extractors, using only the image encoder, for solving image classification tasks. In this paper, we propose to additionally use the text encoder to obtain the text embeddings corresponding to the MiniGPT-4-generated semantic descriptions. Thus, we use both the image and text embeddings for solving the image classification task. The experimental evaluation on three datasets validates the improved classification performance achieved by exploiting LMM-based knowledge.
Related papers
- Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.
We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.
We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification [52.405499816861635]
Multiple instance learning (MIL)-based framework has become the mainstream for processing the whole slide image (WSI)
We propose a dual-scale vision-language multiple instance learning (ViLa-MIL) framework for whole slide image classification.
arXiv Detail & Related papers (2025-02-12T13:28:46Z) - LMM-Regularized CLIP Embeddings for Image Classification [11.801596051153725]
We deal with image classification tasks using the powerful CLIP vision-language model.
We propose a novel Large Multimodal Model (LMM) based regularization method.
In this way, it produces embeddings with enhanced discrimination ability.
arXiv Detail & Related papers (2024-12-16T11:11:23Z) - FLAIR: VLM with Fine-grained Language-informed Image Representations [49.2684130383925]
FLAIR is an approach that utilizes long and detailed image descriptions to learn localized image embeddings.
Our experiments demonstrate the effectiveness of FLAIR trained on 30M image-text pairs in capturing fine-grained visual information.
arXiv Detail & Related papers (2024-12-04T18:56:04Z) - MMR: Evaluating Reading Ability of Large Multimodal Models [52.953316772123586]
Large multimodal models (LMMs) have demonstrated impressive capabilities in understanding various types of image, including text-rich images.
Current benchmarks fail to accurately reflect performance of different models.
We propose the Multi-Modal Reading (MMR) benchmark in 11 diverse tasks to evaluate LMMs for text-rich image understanding.
arXiv Detail & Related papers (2024-08-26T19:26:50Z) - CLIP-Decoder : ZeroShot Multilabel Classification using Multimodal CLIP Aligned Representation [12.994898879803642]
The CLIP-Decoder is a novel method based on the state-of-the-art ML-Decoder attention-based head.
We introduce multi-modal representation learning in CLIP-Decoder, utilizing the text encoder to extract text features and the image encoder for image feature extraction.
Our method achieves an absolute increase of 3.9% in performance compared to existing methods for zero-shot learning multi-label classification tasks.
arXiv Detail & Related papers (2024-06-21T02:19:26Z) - Disturbing Image Detection Using LMM-Elicited Emotion Embeddings [11.801596051153725]
We deal with the task of Disturbing Image Detection (DID), exploiting knowledge encoded in Large Multimodal Models (LMMs)
We propose to exploit LMM knowledge in a two-fold manner: first by extracting generic semantic descriptions, and second by extracting elicited emotions.
The proposed method significantly improves the baseline classification accuracy, achieving state-of-the-art performance on the augmented Disturbing Image Detection dataset.
arXiv Detail & Related papers (2024-06-18T14:41:04Z) - Data-free Multi-label Image Recognition via LLM-powered Prompt Tuning [23.671999163027284]
This paper proposes a novel framework for multi-label image recognition without any training data.
It uses knowledge of pre-trained Large Language Model to learn prompts to adapt pretrained Vision-Language Model like CLIP to multilabel classification.
Our framework presents a new way to explore the synergies between multiple pre-trained models for novel category recognition.
arXiv Detail & Related papers (2024-03-02T13:43:32Z) - I2MVFormer: Large Language Model Generated Multi-View Document
Supervision for Zero-Shot Image Classification [108.83932812826521]
Large Language Models (LLM) trained on web-scale text show impressive abilities to repurpose their learned knowledge for a multitude of tasks.
Our proposed model, I2MVFormer, learns multi-view semantic embeddings for zero-shot image classification with these class views.
I2MVFormer establishes a new state-of-the-art on three public benchmark datasets for zero-shot image classification with unsupervised semantic embeddings.
arXiv Detail & Related papers (2022-12-05T14:11:36Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - Fine-grained Image Classification and Retrieval by Combining Visual and
Locally Pooled Textual Features [8.317191999275536]
In particular, the mere presence of text provides strong guiding content that should be employed to tackle a diversity of computer vision tasks.
In this paper, we address the problem of fine-grained classification and image retrieval by leveraging textual information along with visual cues to comprehend the existing intrinsic relation between the two modalities.
arXiv Detail & Related papers (2020-01-14T12:06:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.