Pushing Boundaries: Exploring Zero Shot Object Classification with Large
Multimodal Models
- URL: http://arxiv.org/abs/2401.00127v1
- Date: Sat, 30 Dec 2023 03:19:54 GMT
- Title: Pushing Boundaries: Exploring Zero Shot Object Classification with Large
Multimodal Models
- Authors: Ashhadul Islam, Md. Rafiul Biswas, Wajdi Zaghouani, Samir Brahim
Belhaouari, Zubair Shah
- Abstract summary: Large Language and Vision Assistant models (LLVAs) engage users in rich conversational experiences intertwined with image-based queries.
This paper takes a unique perspective on LMMs, exploring their efficacy in performing image classification tasks using tailored prompts.
Our study includes a benchmarking analysis across four diverse datasets: MNIST, Cats Vs. Dogs, Hymnoptera (Ants Vs. Bees), and an unconventional dataset comprising Pox Vs. Non-Pox skin images.
- Score: 0.09264362806173355
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: $ $The synergy of language and vision models has given rise to Large Language
and Vision Assistant models (LLVAs), designed to engage users in rich
conversational experiences intertwined with image-based queries. These
comprehensive multimodal models seamlessly integrate vision encoders with Large
Language Models (LLMs), expanding their applications in general-purpose
language and visual comprehension. The advent of Large Multimodal Models (LMMs)
heralds a new era in Artificial Intelligence (AI) assistance, extending the
horizons of AI utilization. This paper takes a unique perspective on LMMs,
exploring their efficacy in performing image classification tasks using
tailored prompts designed for specific datasets. We also investigate the LLVAs
zero-shot learning capabilities. Our study includes a benchmarking analysis
across four diverse datasets: MNIST, Cats Vs. Dogs, Hymnoptera (Ants Vs. Bees),
and an unconventional dataset comprising Pox Vs. Non-Pox skin images. The
results of our experiments demonstrate the model's remarkable performance,
achieving classification accuracies of 85\%, 100\%, 77\%, and 79\% for the
respective datasets without any fine-tuning. To bolster our analysis, we assess
the model's performance post fine-tuning for specific tasks. In one instance,
fine-tuning is conducted over a dataset comprising images of faces of children
with and without autism. Prior to fine-tuning, the model demonstrated a test
accuracy of 55\%, which significantly improved to 83\% post fine-tuning. These
results, coupled with our prior findings, underscore the transformative
potential of LLVAs and their versatile applications in real-world scenarios.
Related papers
- PUB: Plot Understanding Benchmark and Dataset for Evaluating Large Language Models on Synthetic Visual Data Interpretation [2.1184929769291294]
This paper presents a novel synthetic dataset designed to evaluate the proficiency of large language models in interpreting data visualizations.
Our dataset is generated using controlled parameters to ensure comprehensive coverage of potential real-world scenarios.
We employ multimodal text prompts with questions related to visual data in images to benchmark several state-of-the-art models.
arXiv Detail & Related papers (2024-09-04T11:19:17Z) - MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
This dataset includes figures such as schematic diagrams, simulated images, macroscopic/microscopic photos, and experimental visualizations.
We developed benchmarks for scientific figure captioning and multiple-choice questions, evaluating six proprietary and over ten open-source models.
The dataset and benchmarks will be released to support further research.
arXiv Detail & Related papers (2024-07-06T00:40:53Z) - Enhancing Large Vision Language Models with Self-Training on Image Comprehension [131.14381425260706]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension.
First, the model self-constructs a preference for image descriptions using unlabeled images.
To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z) - Multi-modal Auto-regressive Modeling via Visual Words [96.25078866446053]
We propose the concept of visual tokens, which maps the visual features to probability distributions over Large Multi-modal Models' vocabulary.
We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.
arXiv Detail & Related papers (2024-03-12T14:58:52Z) - CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models [58.95889895912716]
We introduce a new benchmark, named as CODIS, designed to assess the ability of models to use context provided in free-form text to enhance visual comprehension.
Our findings indicate that MLLMs consistently fall short of human performance on this benchmark.
This underscores the pressing need to enhance the ability of MLLMs to comprehend visuals in a context-dependent manner.
arXiv Detail & Related papers (2024-02-21T08:21:12Z) - Data-efficient Large Vision Models through Sequential Autoregression [58.26179273091461]
We develop an efficient, autoregression-based vision model on a limited dataset.
We demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding.
Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint.
arXiv Detail & Related papers (2024-02-07T13:41:53Z) - Veagle: Advancements in Multimodal Representation Learning [0.0]
This paper introduces a novel approach to enhance the multimodal capabilities of existing models.
Our proposed model Veagle, incorporates a unique mechanism inspired by the successes and insights of previous works.
Our results indicate a improvement of 5-6 % in performance, with Veagle outperforming existing models by a notable margin.
arXiv Detail & Related papers (2024-01-18T12:45:25Z) - Visual Data-Type Understanding does not emerge from Scaling
Vision-Language Models [31.69213233651326]
We introduce the novel task of Visual Data-Type Identification.
An extensive zero-shot evaluation of 39 vision-language models (VLMs) shows a nuanced performance landscape.
arXiv Detail & Related papers (2023-10-12T17:59:30Z) - StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized
Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning.
This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models.
Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z) - Prefix Language Models are Unified Modal Learners [30.666873206462295]
We show that a unified modal model could be learned with a prefix language modeling objective upon text and image sequences.
Thanks to the simple but powerful pre-training paradigm, our proposed model, DaVinci, is simple to train, scalable to huge data, and adaptable to a variety of downstream tasks.
arXiv Detail & Related papers (2022-06-15T17:49:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.