VGBench: Evaluating Large Language Models on Vector Graphics Understanding and Generation
- URL: http://arxiv.org/abs/2407.10972v1
- Date: Mon, 15 Jul 2024 17:59:55 GMT
- Title: VGBench: Evaluating Large Language Models on Vector Graphics Understanding and Generation
- Authors: Bocheng Zou, Mu Cai, Jianrui Zhang, Yong Jae Lee,
- Abstract summary: VGBench is a comprehensive benchmark for Large Language Models (LLMs) on handling vector graphics.
LLMs show strong capability on both aspects while exhibiting less desirable performance on low-level formats (SVG)
- Score: 28.1277394934428
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the realm of vision models, the primary mode of representation is using pixels to rasterize the visual world. Yet this is not always the best or unique way to represent visual content, especially for designers and artists who depict the world using geometry primitives such as polygons. Vector graphics (VG), on the other hand, offer a textual representation of visual content, which can be more concise and powerful for content like cartoons or sketches. Recent studies have shown promising results on processing vector graphics with capable Large Language Models (LLMs). However, such works focus solely on qualitative results, understanding, or a specific type of vector graphics. We propose VGBench, a comprehensive benchmark for LLMs on handling vector graphics through diverse aspects, including (a) both visual understanding and generation, (b) evaluation of various vector graphics formats, (c) diverse question types, (d) wide range of prompting techniques, (e) under multiple LLMs. Evaluating on our collected 4279 understanding and 5845 generation samples, we find that LLMs show strong capability on both aspects while exhibiting less desirable performance on low-level formats (SVG). Both data and evaluation pipeline will be open-sourced at https://vgbench.github.io.
Related papers
- How Well Can Vision Language Models See Image Details? [53.036922527685064]
We introduce a pixel value prediction task to explore "How Well Can Vision Language Models See Image Details?"
Our research reveals that incorporating pixel value prediction as one of the VLM pre-training tasks and vision encoder adaptation markedly boosts VLM performance on downstream image-language understanding tasks.
arXiv Detail & Related papers (2024-08-07T17:59:40Z) - OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding [112.87441334765693]
OMG-LLaVA is a new framework combining powerful pixel-level vision understanding with reasoning abilities.
It can accept various visual and text prompts for flexible user interaction.
OMG-LLaVA achieves image-level, object-level, and pixel-level reasoning and understanding in a single model.
arXiv Detail & Related papers (2024-06-27T17:59:01Z) - Re-Thinking Inverse Graphics With Large Language Models [51.333105116400205]
Inverse graphics -- inverting an image into physical variables that, when rendered, enable reproduction of the observed scene -- is a fundamental challenge in computer vision and graphics.
We propose the Inverse-Graphics Large Language Model (IG-LLM), an inversegraphics framework centered around an LLM.
We incorporate a frozen pre-trained visual encoder and a continuous numeric head to enable end-to-end training.
arXiv Detail & Related papers (2024-04-23T16:59:02Z) - Visually Descriptive Language Model for Vector Graphics Reasoning [76.42082386029206]
We propose the Visually Descriptive Language Model (VDLM) to bridge the gap between low-level visual perception and high-level language reasoning.
We show that VDLM significantly improves state-of-the-art LMMs like GPT-4o on various multimodal perception and reasoning tasks.
arXiv Detail & Related papers (2024-04-09T17:30:18Z) - CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models [58.95889895912716]
We introduce a new benchmark, named as CODIS, designed to assess the ability of models to use context provided in free-form text to enhance visual comprehension.
Our findings indicate that MLLMs consistently fall short of human performance on this benchmark.
This underscores the pressing need to enhance the ability of MLLMs to comprehend visuals in a context-dependent manner.
arXiv Detail & Related papers (2024-02-21T08:21:12Z) - StrokeNUWA: Tokenizing Strokes for Vector Graphic Synthesis [112.25071764647683]
StrokeNUWA is a pioneering work exploring a better visual representation ''stroke tokens'' on vector graphics.
equipped with stroke tokens, StrokeNUWA can significantly surpass traditional LLM-based and optimization-based methods.
StrokeNUWA achieves up to a 94x speedup in inference over the speed of prior methods with an exceptional SVG code compression ratio of 6.9%.
arXiv Detail & Related papers (2024-01-30T15:20:26Z) - Im2Vec: Synthesizing Vector Graphics without Vector Supervision [31.074606918245298]
Vector graphics are widely used to represent fonts, logos, digital artworks, and graphic designs.
One can alwaysize the input graphic and resort to image-based generative approaches.
Current models that require explicit supervision on the vector representation at training time are difficult to obtain.
We propose a new neural network that can generate complex vector graphics with varying topologies.
arXiv Detail & Related papers (2021-02-04T18:39:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.