Related papers: FMBench: Adaptive Large Language Model Output Formatting

FMBench: Adaptive Large Language Model Output Formatting

URL: http://arxiv.org/abs/2602.06384v1
Date: Fri, 06 Feb 2026 04:42:06 GMT
Title: FMBench: Adaptive Large Language Model Output Formatting
Authors: Yaoting Wang, Yun Zhou, Henghui Ding,
Abstract summary: We present FMBench, a benchmark for adaptive Markdown output formatting.<n>Experiments on two model families show that SFT consistently improves semantic alignment.<n>Results also reveal an inherent trade-off between semantic and structural objectives.
Score: 49.52930069696333
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Producing outputs that satisfy both semantic intent and format constraints is essential for deploying large language models in user-facing and system-integrated workflows. In this work, we focus on Markdown formatting, which is ubiquitous in assistants, documentation, and tool-augmented pipelines but still prone to subtle, hard-to-detect errors (e.g., broken lists, malformed tables, inconsistent headings, and invalid code blocks) that can significantly degrade downstream usability. We present FMBench, a benchmark for adaptive Markdown output formatting that evaluates models under a wide range of instruction-following scenarios with diverse structural requirements. FMBench emphasizes real-world formatting behaviors such as multi-level organization, mixed content (natural language interleaved with lists/tables/code), and strict adherence to user-specified layout constraints. To improve Markdown compliance without relying on hard decoding constraints, we propose a lightweight alignment pipeline that combines supervised fine-tuning (SFT) with reinforcement learning fine-tuning. Starting from a base model, we first perform SFT on instruction-response pairs, and then optimize a composite objective that balances semantic fidelity with structural correctness. Experiments on two model families (OpenPangu and Qwen) show that SFT consistently improves semantic alignment, while reinforcement learning provides additional gains in robustness to challenging Markdown instructions when initialized from a strong SFT policy. Our results also reveal an inherent trade-off between semantic and structural objectives, highlighting the importance of carefully designed rewards for reliable formatted generation. Code is available at: https://github.com/FudanCVL/FMBench.

Related papers

CORE: Context-Robust Remasking for Diffusion Language Models [51.59514489363897]
We propose Context-Robust Remasking (CORE), a training-free framework for inference-time revision.<n>Rather than trusting static token probabilities, CORE identifies context-brittle tokens by probing their sensitivity to targeted masked-context perturbations.<n>On LLaDA-8B-Base, CORE delivers consistent improvements across reasoning and code benchmarks, outperforming compute-matched baselines and improving MBPP by up to 9.2 percentage points.
arXiv Detail & Related papers (2026-02-04T00:12:30Z)
FocalOrder: Focal Preference Optimization for Reading Order Detection [23.497081928689525]
We propose textbfFocalOrder, a framework driven by textbfFocal Preference Optimization (FPO).<n>FocalOrder employs adaptive difficulty discovery with exponential moving average mechanism to dynamically pinpoint hard-to-learn transitions.<n>Experiments demonstrate that FocalOrder establishes new state-of-the-art results on OmniDocBench v1.0 and Comp-HRDoc.
arXiv Detail & Related papers (2026-01-12T12:37:04Z)
AdaFuse: Adaptive Ensemble Decoding with Test-Time Scaling for LLMs [46.52320938421707]
Inference-time ensembling provides a practical way to combine large language model capabilities without retraining.<n>We propose AdaFuse, an adaptive ensemble decoding framework that dynamically selects semantically appropriate fusion units during generation.<n>AdaFuse consistently outperforms strong ensemble baselines, achieving an average relative improvement of 6.88%.
arXiv Detail & Related papers (2026-01-09T18:58:22Z)
RL-Struct: A Lightweight Reinforcement Learning Framework for Reliable Structured Output in LLMs [0.08594140167290097]
Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language generation and reasoning.<n>Their integration into automated software ecosystems is often hindered by the "Structure Gap"<n>We propose a lightweight, efficient Reinforcement Learning framework to bridge this gap.
arXiv Detail & Related papers (2025-11-29T04:47:14Z)
Data Dependency-Aware Code Generation from Enhanced UML Sequence Diagrams [54.528185120850274]
We propose a novel step-by-step code generation framework named API2Dep.<n>First, we introduce an enhanced Unified Modeling Language (UML) API diagram tailored for service-oriented architectures.<n>Second, recognizing the critical role of data flow, we introduce a dedicated data dependency inference task.
arXiv Detail & Related papers (2025-08-05T12:28:23Z)
From Legacy to Standard: LLM-Assisted Transformation of Cybersecurity Playbooks into CACAO Format [0.5339846068056558]
Existing cybersecurity playbooks are often written in heterogeneous, non-machine-readable formats.<n>This paper explores the suitability of Large Language Models, combined with Prompt Engineering, to automatically translate legacy incident response playbooks into the standardized, machine-readable CACAO format.
arXiv Detail & Related papers (2025-08-05T11:43:54Z)
The Price of Format: Diversity Collapse in LLMs [32.616432249190716]
Large language models (LLMs) employ structured templates, such as role markers and special tokens, to enforce format consistency during inference.<n>We systematically evaluate this effect across tasks like story completion and free-form generation, finding that diversity collapse persists even under high-temperature sampling.<n>To contextualize these findings, we fine-tune the same model using a range of structured prompts and then evaluate them across three axes: downstream task performance, alignment behavior, and output diversity.
arXiv Detail & Related papers (2025-05-25T02:52:35Z)
Rethinking Addressing in Language Models via Contexualized Equivariant Positional Encoding [89.52931576290976]
We present contextbfTextualized equivaritextbfAnt textbfPosition textbfEncoding (textbfTAPE), a novel framework that enhances positional embeddings by incorporating sequence content across layers.<n>Our method can be easily integrated into pre-trained transformers, offering parameter-efficient fine-tuning with minimal overhead.
arXiv Detail & Related papers (2025-01-01T03:23:00Z)
HySem: A context length optimized LLM pipeline for unstructured tabular extraction [0.0]
We introduce HySem, a pipeline that employs a novel context length optimization technique to generate accurate semantic representations from HTML tables. Running on commodity hardware, HySem surpasses its peer open-source models in accuracy and provides competitive performance when benchmarked against OpenAI GPT-4o.
arXiv Detail & Related papers (2024-08-18T10:40:37Z)
Contrastive Instruction Tuning [61.97704869248903]
We propose Contrastive Instruction Tuning to maximize the similarity between semantically equivalent instruction-instance pairs. Experiments on the PromptBench benchmark show that CoIN consistently improves LLMs' robustness to unseen instructions with variations across character, word, sentence, and semantic levels by an average of +2.5% in accuracy.
arXiv Detail & Related papers (2024-02-17T00:09:32Z)
Learning Label Modular Prompts for Text Classification in the Wild [56.66187728534808]
We propose text classification in-the-wild, which introduces different non-stationary training/testing stages. Decomposing a complex task into modular components can enable robust generalisation under such non-stationary environment. We propose MODULARPROMPT, a label-modular prompt tuning framework for text classification tasks.
arXiv Detail & Related papers (2022-11-30T16:26:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.