StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs
- URL: http://arxiv.org/abs/2505.20139v1
- Date: Mon, 26 May 2025 15:40:42 GMT
- Title: StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs
- Authors: Jialin Yang, Dongfu Jiang, Lipeng He, Sherman Siu, Yuxuan Zhang, Disen Liao, Zhuofeng Li, Huaye Zeng, Yiming Jia, Haozhe Wang, Benjamin Schneider, Chi Ruan, Wentao Ma, Zhiheng Lyu, Yifei Wang, Yi Lu, Quy Duc Do, Ziyan Jiang, Ping Nie, Wenhu Chen,
- Abstract summary: StructEval is a benchmark for evaluating Large Language Models' capabilities in producing structured formats.<n>Our benchmark encompasses 18 formats and 44 types of task, with novel metrics for format adherence and structural correctness.<n>Results reveal significant performance gaps, even state-of-the-art models like o1-mini achieve only 75.58 average score.
- Score: 39.108050455592036
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As Large Language Models (LLMs) become integral to software development workflows, their ability to generate structured outputs has become critically important. We introduce StructEval, a comprehensive benchmark for evaluating LLMs' capabilities in producing both non-renderable (JSON, YAML, CSV) and renderable (HTML, React, SVG) structured formats. Unlike prior benchmarks, StructEval systematically evaluates structural fidelity across diverse formats through two paradigms: 1) generation tasks, producing structured output from natural language prompts, and 2) conversion tasks, translating between structured formats. Our benchmark encompasses 18 formats and 44 types of task, with novel metrics for format adherence and structural correctness. Results reveal significant performance gaps, even state-of-the-art models like o1-mini achieve only 75.58 average score, with open-source alternatives lagging approximately 10 points behind. We find generation tasks more challenging than conversion tasks, and producing correct visual content more difficult than generating text-only structures.
Related papers
- The Effectiveness of Large Language Models in Transforming Unstructured Text to Standardized Formats [0.0]
This study systematically evaluating Large Language Models' ability to convert unstructured text into structured formats.<n>Experiments reveal that GPT-4o with few-shot prompting achieves breakthrough performance.<n>These findings open new possibilities for automated structured data generation across various domains.
arXiv Detail & Related papers (2025-03-04T14:14:28Z) - Enhancing LLM Character-Level Manipulation via Divide and Conquer [74.55804812450164]
Large Language Models (LLMs) have demonstrated strong generalization capabilities across a wide range of natural language processing (NLP) tasks.<n>They exhibit notable weaknesses in character-level string manipulation, struggling with fundamental operations such as character deletion, insertion, and substitution.<n>We propose Character-Level Manipulation via Divide and Conquer, a novel approach designed to bridge the gap between token-level processing and character-level manipulation.
arXiv Detail & Related papers (2025-02-12T07:37:39Z) - StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs [78.84060166851805]
StructTest is a novel benchmark that evaluates large language models (LLMs) on their ability to follow compositional instructions and generate structured outputs.<n> Assessments are conducted deterministically using a rule-based evaluator, which can be easily extended to new tasks and datasets.<n>We demonstrate that StructTest remains challenging even for top-performing models like Deepseek-V3/R1 and GPT-4o.
arXiv Detail & Related papers (2024-12-23T22:08:40Z) - Enhancing LLM's Cognition via Structurization [41.13997892843677]
Large language models (LLMs) process input contexts through a causal and sequential perspective.
This paper presents a novel concept of context structurization.
Specifically, we transform the plain, unordered contextual sentences into well-ordered and hierarchically structurized elements.
arXiv Detail & Related papers (2024-07-23T12:33:58Z) - StrucText-Eval: Evaluating Large Language Model's Reasoning Ability in Structure-Rich Text [29.03935605732864]
We introduce StrucText-Eval, a benchmark to evaluate how well large language models understand and reason through structured text.
We show that while open-source LLMs achieve a maximum accuracy of 74.9% on the standard dataset, their performance drops significantly to 45.8% on the harder dataset.
In contrast, human participants reach an accuracy of 92.6% on StrucText-Eval-Hard, highlighting LLMs' current limitations in handling intricate structural information.
arXiv Detail & Related papers (2024-06-15T12:48:00Z) - StructLM: Towards Building Generalist Models for Structured Knowledge Grounding [49.10029030628653]
Large language models' (LLMs) ability to process structured data lags behind state-of-the-art (SoTA) model by an average of 35%.
We train a series of models, referred to as StructLM, based on the Mistral and the CodeLlama model family, ranging from 7B to 34B parameters.
Our StructLM series surpasses task-specific models on 16 out of 18 evaluated datasets and establishes new SoTA performance on 8 SKG tasks.
arXiv Detail & Related papers (2024-02-26T15:47:01Z) - A Simple but Effective Approach to Improve Structured Language Model
Output for Information Extraction [11.165093163378152]
Large language models (LLMs) have demonstrated impressive abilities in generating unstructured natural language according to instructions.
This paper introduces an efficient method, G&O, to enhance their structured text generation capabilities.
arXiv Detail & Related papers (2024-02-20T20:42:02Z) - Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data? [49.688233418425995]
Struc-Bench is a comprehensive benchmark featuring prominent Large Language Models (LLMs)
We propose two innovative metrics, P-Score (Prompting Score) and H-Score (Heuristical Score)
Our experiments show that applying our structure-aware fine-tuning to LLaMA-7B leads to substantial performance gains.
arXiv Detail & Related papers (2023-09-16T11:31:58Z) - Unified Text Structuralization with Instruction-tuned Language Models [28.869098023025753]
We propose a simple and efficient approach to instruct large language model (LLM) to extract a variety of structures from texts.
Experiments show that this approach can enable language models to perform comparable with other state-of-the-art methods on datasets of a variety of languages and knowledge.
arXiv Detail & Related papers (2023-03-27T07:39:05Z) - Autoregressive Structured Prediction with Language Models [73.11519625765301]
We describe an approach to model structures as sequences of actions in an autoregressive manner with PLMs.
Our approach achieves the new state-of-the-art on all the structured prediction tasks we looked at.
arXiv Detail & Related papers (2022-10-26T13:27:26Z) - Structural Biases for Improving Transformers on Translation into
Morphologically Rich Languages [120.74406230847904]
TP-Transformer augments the traditional Transformer architecture to include an additional component to represent structure.
The second method imbues structure at the data level by segmenting the data with morphological tokenization.
We find that each of these two approaches allows the network to achieve better performance, but this improvement is dependent on the size of the dataset.
arXiv Detail & Related papers (2022-08-11T22:42:24Z) - GroupBERT: Enhanced Transformer Architecture with Efficient Grouped
Structures [57.46093180685175]
We demonstrate a set of modifications to the structure of a Transformer layer, producing a more efficient architecture.
We add a convolutional module to complement the self-attention module, decoupling the learning of local and global interactions.
We apply the resulting architecture to language representation learning and demonstrate its superior performance compared to BERT models of different scales.
arXiv Detail & Related papers (2021-06-10T15:41:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.