Leveraging Large Language Models for Scalable Vector Graphics-Driven Image Understanding
- URL: http://arxiv.org/abs/2306.06094v2
- Date: Thu, 11 Jul 2024 17:59:53 GMT
- Title: Leveraging Large Language Models for Scalable Vector Graphics-Driven Image Understanding
- Authors: Mu Cai, Zeyi Huang, Yuheng Li, Utkarsh Ojha, Haohan Wang, Yong Jae Lee,
- Abstract summary: Large language models (LLMs) have made significant advancements in natural language understanding.
This work investigates if it is possible for the LLM to understand images as well.
- Score: 46.042197741423365
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have made significant advancements in natural language understanding. However, through that enormous semantic representation that the LLM has learnt, is it somehow possible for it to understand images as well? This work investigates this question. To enable the LLM to process images, we convert them into a representation given by Scalable Vector Graphics (SVG). To study what the LLM can do with this XML-based textual description of images, we test the LLM on three broad computer vision tasks: (i) visual reasoning and question answering, (ii) image classification under distribution shift, few-shot learning, and (iii) generating new images using visual prompting. Even though we do not naturally associate LLMs with any visual understanding capabilities, our results indicate that the LLM can often do a decent job in many of these tasks, potentially opening new avenues for research into LLMs' ability to understand image data. Our code, data, and models can be found here https://github.com/mu-cai/svg-llm.
Related papers
- Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets.
However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs.
This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z) - Dense Connector for MLLMs [89.50595155217108]
We introduce the Dense Connector - a plug-and-play vision-language connector that significantly enhances existing MLLMs.
Our model, trained solely on images, showcases remarkable zero-shot capabilities in video understanding as well.
arXiv Detail & Related papers (2024-05-22T16:25:03Z) - Beyond Text: Frozen Large Language Models in Visual Signal Comprehension [34.398976855955404]
Vision-to-Language Tokenizer, abbreviated as V2T Tokenizer, transforms an image into a foreign language'' with the combined aid of an encoder-decoder, the LLM vocabulary, and a CLIP model.
We undertake rigorous experiments to validate our method, encompassing understanding tasks like image recognition, image captioning, and visual question answering, as well as image denoising tasks like inpainting, outpainting, deblurring, and shift restoration.
arXiv Detail & Related papers (2024-03-12T17:59:51Z) - CLAMP: Contrastive LAnguage Model Prompt-tuning [89.96914454453791]
We show that large language models can achieve good image classification performance when adapted this way.
Our approach beats state-of-the-art mLLMs by 13% and slightly outperforms contrastive learning with a custom text model.
arXiv Detail & Related papers (2023-12-04T05:13:59Z) - Filling the Image Information Gap for VQA: Prompting Large Language
Models to Proactively Ask Questions [15.262736501208467]
Large Language Models (LLMs) demonstrate impressive reasoning ability and the maintenance of world knowledge.
As images are invisible to LLMs, researchers convert images to text to engage LLMs into the visual question reasoning procedure.
We design a framework that enables LLMs to proactively ask relevant questions to unveil more details in the image.
arXiv Detail & Related papers (2023-11-20T08:23:39Z) - Frozen Transformers in Language Models Are Effective Visual Encoder Layers [26.759544759745648]
Large language models (LLMs) are surprisingly strong encoders for purely visual tasks in the absence of language.
Our work pushes the boundaries of leveraging LLMs for computer vision tasks.
We propose the information filtering hypothesis to explain the effectiveness of pre-trained LLMs in visual encoding.
arXiv Detail & Related papers (2023-10-19T17:59:05Z) - SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen
LLMs [124.29233620842462]
We introduce SPAE for enabling frozen LLMs to perform both understanding and generation tasks involving non-linguistic modalities such as images or videos.
The resulting lexical tokens capture both the semantic meaning and the fine-grained details needed for visual reconstruction.
Our method marks the first successful attempt to enable a frozen LLM to generate image content while surpassing state-of-the-art performance in image understanding tasks, under the same setting, by over 25%.
arXiv Detail & Related papers (2023-06-30T17:59:07Z) - LLM-CXR: Instruction-Finetuned LLM for CXR Image Understanding and Generation [51.08810811457617]
vision-language alignment in LLMs is actively being researched to enable multimodal reasoning and visual IO.
We develop a method for instruction-tuning an LLM only on text to gain vision-language capabilities for medical images.
Our model, LLM-CXR, trained in this approach shows better image-text alignment in both CXR understanding and generation tasks.
arXiv Detail & Related papers (2023-05-19T07:44:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.