SO-Bench: A Structural Output Evaluation of Multimodal LLMs
- URL: http://arxiv.org/abs/2511.21750v1
- Date: Sun, 23 Nov 2025 16:53:16 GMT
- Title: SO-Bench: A Structural Output Evaluation of Multimodal LLMs
- Authors: Di Feng, Kaixin Ma, Feng Nan, Haofeng Chen, Bohan Zhai, David Griffiths, Mingfei Gao, Zhe Gan, Eshan Verma, Yinfei Yang, Zhifeng Chen, Afshin Dehghan,
- Abstract summary: Multimodal large language models (MLLMs) are increasingly deployed in real-world, agentic settings.<n>Despite recent progress in structured generation in textual domain, there is still no benchmark that systematically evaluates schema-grounded information extraction and reasoning over visual inputs.<n>We conduct a comprehensive study of visual structural output capabilities for MLLMs with our carefully designed SO-Bench benchmark.
- Score: 43.74273224757814
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal large language models (MLLMs) are increasingly deployed in real-world, agentic settings where outputs must not only be correct, but also conform to predefined data schemas. Despite recent progress in structured generation in textual domain, there is still no benchmark that systematically evaluates schema-grounded information extraction and reasoning over visual inputs. In this work, we conduct a comprehensive study of visual structural output capabilities for MLLMs with our carefully designed SO-Bench benchmark. Covering four visual domains, including UI screens, natural images, documents, and charts, SO-Bench is built from over 6.5K diverse JSON schemas and 1.8K curated image-schema pairs with human-verified quality. Benchmarking experiments on open-sourced and frontier proprietary models reveal persistent gaps in predicting accurate, schema compliant outputs, highlighting the need for better multimodal structured reasoning. Beyond benchmarking, we further conduct training experiments to largely improve the model's structured output capability. We plan to make the benchmark available to the community.
Related papers
- LLMStructBench: Benchmarking Large Language Model Structured Data Extraction [1.338174941551702]
We present a novel benchmark for evaluating Large Language Models (LLMs)<n>Our open dataset comprises diverse, manually verified parsing scenarios of varying complexity.<n>We show that choosing the right prompting strategy is more important than standard attributes such as model size.
arXiv Detail & Related papers (2026-02-16T13:37:58Z) - Instruction-Tuning Open-Weight Language Models for BPMN Model Generation [0.0]
We investigate whether open-weight large language models, adapted via instruction tuning, can generate high-quality BPMN process models.<n>We introduce InstruBPM, a reproducible approach that prepares paired text-diagram data and instruction tunes an open source large language model.<n>We compare the tuned model with untuned open-weight baselines and strong proprietary models under consistent prompting regimes.
arXiv Detail & Related papers (2025-12-12T22:07:51Z) - Logics-Parsing Technical Report [8.982345117231661]
We propose Logics-Parsing: an end-to-end LVLM-based model augmented with reinforcement learning.<n>Our model incorporates meticulously designed reward mechanisms to optimize complex layout analysis and reading order inference.<n>We introduce LogicsParsingBench, a curated set of 1,078 page-level PDF images spanning nine major categories and over twenty sub-categories.
arXiv Detail & Related papers (2025-09-24T04:54:37Z) - Beyond Isolated Dots: Benchmarking Structured Table Construction as Deep Knowledge Extraction [80.88654868264645]
Arranged and Organized Extraction Benchmark designed to evaluate ability of large language models to comprehend fragmented documents.<n>AOE includes 11 carefully crafted tasks across three diverse domains, requiring models to generate context-specific schema tailored to varied input queries.<n>Results show that even the most advanced models struggled significantly.
arXiv Detail & Related papers (2025-07-22T06:37:51Z) - Abstractive Visual Understanding of Multi-modal Structured Knowledge: A New Perspective for MLLM Evaluation [48.462734327375536]
Multi-modal large language models (MLLMs) incorporate heterogeneous modalities into LLMs, enabling a comprehensive understanding of diverse scenarios and objects.<n>Despite the proliferation of evaluation benchmarks and leaderboards for MLLMs, they predominantly overlook the critical capacity of MLLMs to comprehend world knowledge with structured abstractions that appear in visual form.<n>We propose M3STR, an innovative benchmark grounded in the Multi-Modal Map for STRuctured understanding.<n>Our findings reveal persistent deficiencies in processing abstractive visual information with structured knowledge, thereby charting a pivotal trajectory for advancing MLLMs' holistic reasoning capacities.
arXiv Detail & Related papers (2025-06-02T04:00:35Z) - BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data [61.936320820180875]
Large language models (LLMs) have become increasingly pivotal across various domains.
BabelBench is an innovative benchmark framework that evaluates the proficiency of LLMs in managing multimodal multistructured data with code execution.
Our experimental findings on BabelBench indicate that even cutting-edge models like ChatGPT 4 exhibit substantial room for improvement.
arXiv Detail & Related papers (2024-10-01T15:11:24Z) - Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [61.143381152739046]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.<n>Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.<n>We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z) - CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models [58.95889895912716]
We introduce a new benchmark, named as CODIS, designed to assess the ability of models to use context provided in free-form text to enhance visual comprehension.
Our findings indicate that MLLMs consistently fall short of human performance on this benchmark.
This underscores the pressing need to enhance the ability of MLLMs to comprehend visuals in a context-dependent manner.
arXiv Detail & Related papers (2024-02-21T08:21:12Z) - Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data? [49.688233418425995]
Struc-Bench is a comprehensive benchmark featuring prominent Large Language Models (LLMs)
We propose two innovative metrics, P-Score (Prompting Score) and H-Score (Heuristical Score)
Our experiments show that applying our structure-aware fine-tuning to LLaMA-7B leads to substantial performance gains.
arXiv Detail & Related papers (2023-09-16T11:31:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.