Related papers: Are LLMs Ready for TOON? Benchmarking Structural Correctness-Sustainability Trade-offs in Novel Structured Output Formats

Are LLMs Ready for TOON? Benchmarking Structural Correctness-Sustainability Trade-offs in Novel Structured Output Formats

URL: http://arxiv.org/abs/2601.12014v1
Date: Sat, 17 Jan 2026 11:42:02 GMT
Title: Are LLMs Ready for TOON? Benchmarking Structural Correctness-Sustainability Trade-offs in Novel Structured Output Formats
Authors: Elio Masciari, Vincenzo Moscato, Enea Vincenzo Napolitano, Gian Marco Orlando, Marco Perillo, Diego Russo,
Abstract summary: Large Language Models (LLMs) are increasingly required to generate structured, machine-readable outputs for downstream systems.<n>We argue that structured output formats should be assessed not only in terms of correctness, but also with respect to their environmental efficiency.<n>We introduce a sustainability-aware evaluation framework for structured generation that measures token usage, generation time, and estimated carbon emissions.
Score: 5.0663621870807996
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large Language Models (LLMs) are increasingly required to generate structured, machine-readable outputs for downstream systems. While recent benchmarks have focused on evaluating the structural correctness of such outputs, the environmental impact of inference for different output formats has largely been overlooked. In this paper, we argue that structured output formats should be assessed not only in terms of correctness, but also with respect to their environmental efficiency. To this end, we introduce a sustainability-aware evaluation framework for structured generation that measures token usage, generation time, and estimated carbon emissions. Within this framework, we propose the Environment-Aware Generation Correctness Score (GCS_env), a unified metric that integrates structural correctness with carbon-aware efficiency. Using this framework, we systematically benchmark the novel TOON format against established representations (JSON, XML, YAML) across multiple LLMs spanning different architectures and parameter scales. Our results reveal a consistent trade-off: TOON yields markedly more compact outputs and lower emissions, but lower structural correctness when models lack native support. We show that increased model capacity reduces this gap and that environment-aware scoring can shift format rankings depending on deployment priorities. highlighting the need for sustainability-inclusive benchmarking and provides empirical evidence that compact representations such as TOON can offer practical advantages in large-scale, carbon-conscious LLM deployments.

Related papers

FMBench: Adaptive Large Language Model Output Formatting [49.52930069696333]
We present FMBench, a benchmark for adaptive Markdown output formatting.<n>Experiments on two model families show that SFT consistently improves semantic alignment.<n>Results also reveal an inherent trade-off between semantic and structural objectives.
arXiv Detail & Related papers (2026-02-06T04:42:06Z)
RL-Struct: A Lightweight Reinforcement Learning Framework for Reliable Structured Output in LLMs [0.08594140167290097]
Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language generation and reasoning.<n>Their integration into automated software ecosystems is often hindered by the "Structure Gap"<n>We propose a lightweight, efficient Reinforcement Learning framework to bridge this gap.
arXiv Detail & Related papers (2025-11-29T04:47:14Z)
STED and Consistency Scoring: A Framework for Evaluating LLM Structured Output Reliability [11.095198847819573]
Large Language Models (LLMs) are increasingly deployed for structured data generation.<n>We introduce a comprehensive framework for evaluating and improving consistency in LLM-generated structured outputs.
arXiv Detail & Related papers (2025-11-27T02:49:52Z)
Effects of structure on reasoning in instance-level Self-Discover [0.0]
This paper introduces iSelf-Discover, an instance-level adaptation of the Self-Discover framework, and using it compares dynamically generated structured reasoning with its unstructured counterpart.<n>Our empirical evaluation across diverse benchmarks using state-of-the-art open-source models supports a consistent advantage for unstructured reasoning.
arXiv Detail & Related papers (2025-07-04T07:28:42Z)
Elucidating the Design Space of Multimodal Protein Language Models [69.3650883370033]
Multimodal protein language models (PLMs) integrate sequence and token-based structural information.<n>This paper systematically elucidates the design space of multimodal PLMs to overcome their limitations.<n>Our advancements approach finer-grained supervision, demonstrating that token-based multimodal PLMs can achieve robust structural modeling.
arXiv Detail & Related papers (2025-04-15T17:59:43Z)
Federated Fine-Tuning of Large Language Models: Kahneman-Tversky vs. Direct Preference Optimization [49.88778604259453]
We evaluate Kahneman-Tversky Optimization (KTO) as a fine-tuning method for large language models (LLMs) in federated learning (FL) settings.<n>In both its original (KTOO) and redistributed (KTOR) configurations, KTO consistently outperforms DPO across all benchmarks.<n>These findings establish KTO as a robust and scalable fine-tuning method for FL, motivating its adoption for privacy-preserving, decentralized, and heterogeneous environments.
arXiv Detail & Related papers (2025-02-20T01:44:21Z)
Adaptive Pruning for Large Language Models with Structural Importance Awareness [66.2690963378878]
Large language models (LLMs) have significantly improved language understanding and generation capabilities.<n>LLMs are difficult to deploy on resource-constrained edge devices due to their high computational and storage resource demands.<n>We propose structurally-aware adaptive pruning (SAAP) to significantly reduce the computational and memory costs while maintaining model performance.
arXiv Detail & Related papers (2024-12-19T18:08:04Z)
CEGI: Measuring the trade-off between efficiency and carbon emissions for SLMs and VLMs [0.0]
This paper analyzes the performance of Small Language Models (SLMs) and Vision Language Models (VLMs)<n>To quantify the trade-off between model performance and carbon emissions, we introduce a novel metric called CEGI (Carbon Efficient Gain Index)<n>Our findings suggest that the marginal gains in accuracy from larger models do not justify the substantial increase in carbon emissions.
arXiv Detail & Related papers (2024-12-03T17:32:47Z)
Adaptable Embeddings Network (AEN) [49.1574468325115]
We introduce Adaptable Embeddings Networks (AEN), a novel dual-encoder architecture using Kernel Density Estimation (KDE) AEN allows for runtime adaptation of classification criteria without retraining and is non-autoregressive. The architecture's ability to preprocess and cache condition embeddings makes it ideal for edge computing applications and real-time monitoring systems.
arXiv Detail & Related papers (2024-11-21T02:15:52Z)
Structural Bias for Aspect Sentiment Triplet Extraction [15.273669042985883]
Structural bias has been exploited for aspect sentiment triplet extraction (ASTE) and led to improved performance. It is recognized that explicitly incorporating structural bias would have a negative impact on efficiency, whereas pretrained language models (PLMs) can already capture implicit structures. We propose to address the efficiency issues by using an adapter to integrate structural bias in the PLM and using a cheap-to-compute relative position structure.
arXiv Detail & Related papers (2022-09-02T05:02:18Z)
Long Document Summarization with Top-down and Bottom-up Inference [113.29319668246407]
We propose a principled inference framework to improve summarization models on two aspects. Our framework assumes a hierarchical latent structure of a document where the top-level captures the long range dependency. We demonstrate the effectiveness of the proposed framework on a diverse set of summarization datasets.
arXiv Detail & Related papers (2022-03-15T01:24:51Z)
CAFE: Learning to Condense Dataset by Aligning Features [72.99394941348757]
We propose a novel scheme to Condense dataset by Aligning FEatures (CAFE) At the heart of our approach is an effective strategy to align features from the real and synthetic data across various scales. We validate the proposed CAFE across various datasets, and demonstrate that it generally outperforms the state of the art.
arXiv Detail & Related papers (2022-03-03T05:58:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.