Few-Shot Classification & Segmentation Using Large Language Models Agent
- URL: http://arxiv.org/abs/2311.12065v1
- Date: Sun, 19 Nov 2023 00:33:41 GMT
- Title: Few-Shot Classification & Segmentation Using Large Language Models Agent
- Authors: Tian Meng, Yang Tao, Wuliang Yin
- Abstract summary: We introduce a method that utilises large language models (LLM) as an agent to address the FS-CS problem in a training-free manner.
Our approach achieves state-of-the-art performance on the Pascal-5i dataset.
- Score: 0.7550566004119158
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The task of few-shot image classification and segmentation (FS-CS) requires
the classification and segmentation of target objects in a query image, given
only a few examples of the target classes. We introduce a method that utilises
large language models (LLM) as an agent to address the FS-CS problem in a
training-free manner. By making the LLM the task planner and off-the-shelf
vision models the tools, the proposed method is capable of classifying and
segmenting target objects using only image-level labels. Specifically,
chain-of-thought prompting and in-context learning guide the LLM to observe
support images like human; vision models such as Segment Anything Model (SAM)
and GPT-4Vision assist LLM understand spatial and semantic information at the
same time. Ultimately, the LLM uses its summarizing and reasoning capabilities
to classify and segment the query image. The proposed method's modular
framework makes it easily extendable. Our approach achieves state-of-the-art
performance on the Pascal-5i dataset.
Related papers
- SocialGPT: Prompting LLMs for Social Relation Reasoning via Greedy Segment Optimization [70.11167263638562]
Social relation reasoning aims to identify relation categories such as friends, spouses, and colleagues from images.
We first present a simple yet well-crafted framework named name, which combines the perception capability of Vision Foundation Models (VFMs) and the reasoning capability of Large Language Models (LLMs) within a modular framework.
arXiv Detail & Related papers (2024-10-28T18:10:26Z) - LLM-Seg: Bridging Image Segmentation and Large Language Model Reasoning [8.379286663107845]
Reasoning segmentation is a novel task that enables segmentation system to reason and interpret implicit user intention.
Our work on reasoning segmentation contributes on both the methodological design and dataset labeling.
arXiv Detail & Related papers (2024-04-12T18:45:51Z) - Few-Shot Image Classification and Segmentation as Visual Question Answering Using Vision-Language Models [0.6149772262764599]
We introduce the Vision-Instructed and Evaluation (VISE) method that transforms the FS-CS problem into the Visual Questioning (VQA) problem.
Our approach achieves state-of-the-art performance on the Pascal-5i and COCO-20i datasets.
arXiv Detail & Related papers (2024-03-15T13:29:41Z) - Small LLMs Are Weak Tool Learners: A Multi-LLM Agent [73.54562551341454]
Large Language Model (LLM) agents significantly extend the capabilities of standalone LLMs.
We propose a novel approach that decomposes the aforementioned capabilities into a planner, caller, and summarizer.
This modular framework facilitates individual updates and the potential use of smaller LLMs for building each capability.
arXiv Detail & Related papers (2024-01-14T16:17:07Z) - CLAMP: Contrastive LAnguage Model Prompt-tuning [89.96914454453791]
We show that large language models can achieve good image classification performance when adapted this way.
Our approach beats state-of-the-art mLLMs by 13% and slightly outperforms contrastive learning with a custom text model.
arXiv Detail & Related papers (2023-12-04T05:13:59Z) - LLaFS: When Large Language Models Meet Few-Shot Segmentation [32.86287519276783]
We propose LLaFS, the first attempt to leverage large language models (LLMs) in few-shot segmentation.
In contrast to the conventional few-shot segmentation methods that only rely on the limited and biased information from the annotated support images, LLaFS directly uses the LLM to segment images in a few-shot manner.
LLaFS achieves state-of-the-art results on multiple datasets, showing the potential of using LLMs for few-shot computer vision tasks.
arXiv Detail & Related papers (2023-11-28T16:31:27Z) - RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation.
Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal.
We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z) - I2MVFormer: Large Language Model Generated Multi-View Document
Supervision for Zero-Shot Image Classification [108.83932812826521]
Large Language Models (LLM) trained on web-scale text show impressive abilities to repurpose their learned knowledge for a multitude of tasks.
Our proposed model, I2MVFormer, learns multi-view semantic embeddings for zero-shot image classification with these class views.
I2MVFormer establishes a new state-of-the-art on three public benchmark datasets for zero-shot image classification with unsupervised semantic embeddings.
arXiv Detail & Related papers (2022-12-05T14:11:36Z) - Integrative Few-Shot Learning for Classification and Segmentation [37.50821005917126]
We introduce the integrative task of few-shot classification and segmentation (FS-CS)
FS-CS aims to classify and segment target objects in a query image when the target classes are given with a few examples.
We propose the integrative few-shot learning framework for FS-CS, which trains a learner to construct class-wise foreground maps.
arXiv Detail & Related papers (2022-03-29T16:14:40Z) - Learning Meta-class Memory for Few-Shot Semantic Segmentation [90.28474742651422]
We introduce the concept of meta-class, which is the meta information shareable among all classes.
We propose a novel Meta-class Memory based few-shot segmentation method (MM-Net), where we introduce a set of learnable memory embeddings.
Our proposed MM-Net achieves 37.5% mIoU on the COCO dataset in 1-shot setting, which is 5.1% higher than the previous state-of-the-art.
arXiv Detail & Related papers (2021-08-06T06:29:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.