Image2Struct: Benchmarking Structure Extraction for Vision-Language Models
- URL: http://arxiv.org/abs/2410.22456v1
- Date: Tue, 29 Oct 2024 18:44:59 GMT
- Title: Image2Struct: Benchmarking Structure Extraction for Vision-Language Models
- Authors: Josselin Somerville Roberts, Tony Lee, Chi Heem Wong, Michihiro Yasunaga, Yifan Mai, Percy Liang,
- Abstract summary: Image2Struct is a benchmark to evaluate vision-pixel models (VLMs) on extracting structure from images.
In Image2Struct, VLMs are prompted to generate the underlying structure from an input image.
The structure is then rendered to produce an output image, which is compared against the input image to produce a similarity score.
- Score: 57.531922659664296
- License:
- Abstract: We introduce Image2Struct, a benchmark to evaluate vision-language models (VLMs) on extracting structure from images. Our benchmark 1) captures real-world use cases, 2) is fully automatic and does not require human judgment, and 3) is based on a renewable stream of fresh data. In Image2Struct, VLMs are prompted to generate the underlying structure (e.g., LaTeX code or HTML) from an input image (e.g., webpage screenshot). The structure is then rendered to produce an output image (e.g., rendered webpage), which is compared against the input image to produce a similarity score. This round-trip evaluation allows us to quantitatively evaluate VLMs on tasks with multiple valid structures. We create a pipeline that downloads fresh data from active online communities upon execution and evaluates the VLMs without human intervention. We introduce three domains (Webpages, LaTeX, and Musical Scores) and use five image metrics (pixel similarity, cosine similarity between the Inception vectors, learned perceptual image patch similarity, structural similarity index measure, and earth mover similarity) that allow efficient and automatic comparison between pairs of images. We evaluate Image2Struct on 14 prominent VLMs and find that scores vary widely, indicating that Image2Struct can differentiate between the performances of different VLMs. Additionally, the best score varies considerably across domains (e.g., 0.402 on sheet music vs. 0.830 on LaTeX equations), indicating that Image2Struct contains tasks of varying difficulty. For transparency, we release the full results at https://crfm.stanford.edu/helm/image2struct/v1.0.1/.
Related papers
- VisMin: Visual Minimal-Change Understanding [7.226130826257802]
We introduce a new, challenging benchmark termed textbfVisual textbfMinimal-Change Understanding (VisMin)
VisMin requires models to predict the correct image-caption match given two images and two captions.
We generate a large-scale training dataset to finetune CLIP and Idefics2, showing significant improvements in fine-grained understanding across benchmarks.
arXiv Detail & Related papers (2024-07-23T18:10:43Z) - Doppelgangers: Learning to Disambiguate Images of Similar Structures [76.61267007774089]
Illusory image matches can be challenging for humans to differentiate, and can lead 3D reconstruction algorithms to produce erroneous results.
We propose a learning-based approach to visual disambiguation, formulating it as a binary classification task on image pairs.
Our evaluation shows that our method can distinguish illusory matches in difficult cases, and can be integrated into SfM pipelines to produce correct, disambiguated 3D reconstructions.
arXiv Detail & Related papers (2023-09-05T17:50:36Z) - Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval [68.61855682218298]
Cross-modal retrieval methods employ two-stream encoders with different architectures for images and texts.
Inspired by recent advances of Transformers in vision tasks, we propose to unify the encoder architectures with Transformers for both modalities.
We design a cross-modal retrieval framework purely based on two-stream Transformers, dubbed textbfHierarchical Alignment Transformers (HAT), which consists of an image Transformer, a text Transformer, and a hierarchical alignment module.
arXiv Detail & Related papers (2023-08-08T15:43:59Z) - Robust Graph Structure Learning over Images via Multiple Statistical
Tests [25.97631995863608]
A natural way to construct a graph among images is to treat each image as a node and assign pairwise image similarities as weights to corresponding edges.
It is well known that pairwise similarities between images are sensitive to the noise in feature representations, leading to unreliable graph structures.
By viewing the feature vector of each node as an independent sample, the decision of whether creating an edge between two nodes based on their similarity in feature representation can be thought as a $it single$ statistical test.
arXiv Detail & Related papers (2022-10-08T07:56:13Z) - Bringing Image Scene Structure to Video via Frame-Clip Consistency of
Object Tokens [93.98605636451806]
StructureViT shows how utilizing the structure of a small number of images only available during training can improve a video model.
SViT shows strong performance improvements on multiple video understanding tasks and datasets.
arXiv Detail & Related papers (2022-06-13T17:45:05Z) - L2C: Describing Visual Differences Needs Semantic Understanding of
Individuals [65.87728481187625]
We introduce a Learning-to-Compare model, which learns to understand the semantic structures of two images and compare them while learning to describe each one.
We demonstrate that L2C benefits from a comparison between explicit semantic representations and single-image captions, and generalizes better on the new testing image pairs.
arXiv Detail & Related papers (2021-02-03T03:44:42Z) - Pyramidal Convolution: Rethinking Convolutional Neural Networks for
Visual Recognition [98.10703825716142]
This work introduces pyramidal convolution (PyConv), which is capable of processing the input at multiple filter scales.
We present different architectures based on PyConv for four main tasks on visual recognition: image classification, video action classification/recognition, object detection and semantic image segmentation/parsing.
arXiv Detail & Related papers (2020-06-20T10:19:29Z) - Deep Multimodal Image-Text Embeddings for Automatic Cross-Media
Retrieval [0.0]
We introduce an end-to-end deep multimodal convolutional-recurrent network for learning both vision and language representations simultaneously.
The model learns which pairs are a match (positive) and which ones are a mismatch (negative) using a hinge-based triplet ranking.
arXiv Detail & Related papers (2020-02-23T23:58:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.