StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs
- URL: http://arxiv.org/abs/2412.18011v1
- Date: Mon, 23 Dec 2024 22:08:40 GMT
- Title: StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs
- Authors: Hailin Chen, Fangkai Jiao, Mathieu Ravaut, Nawshad Farruque, Xuan Phi Nguyen, Chengwei Qin, Manan Dey, Bosheng Ding, Caiming Xiong, Shafiq Joty, Yingbo Zhou,
- Abstract summary: StructTest is a novel benchmark that evaluates large language models on their ability to produce structured outputs.
We demonstrate that StructTest serves as a good proxy for general reasoning abilities.
- Score: 78.84060166851805
- License:
- Abstract: The rapid development of large language models (LLMs) necessitates robust, unbiased, and scalable methods for evaluating their capabilities. However, human annotations are expensive to scale, model-based evaluations are prone to biases in answer style, while target-answer-based benchmarks are vulnerable to data contamination and cheating. To address these limitations, we propose StructTest, a novel benchmark that evaluates LLMs on their ability to produce compositionally specified structured outputs as an unbiased, cheap-to-run and difficult-to-cheat measure. The evaluation is done deterministically by a rule-based evaluator, which can be easily extended to new tasks. By testing structured outputs across diverse task domains -- including Summarization, Code, HTML and Math -- we demonstrate that StructTest serves as a good proxy for general reasoning abilities, as producing structured outputs often requires internal logical reasoning. We believe that StructTest offers a critical, complementary approach to objective and robust model evaluation.
Related papers
- StructFlowBench: A Structured Flow Benchmark for Multi-turn Instruction Following [13.077503628759446]
Multi-turn instruction following capability constitutes a core competency of large language models.
We propose StructFlowBench, a multi-turn instruction following benchmark with structural flow modeling.
arXiv Detail & Related papers (2025-02-20T12:22:18Z) - Beyond the Singular: The Essential Role of Multiple Generations in Effective Benchmark Evaluation and Analysis [10.133537818749291]
Large language models (LLMs) have demonstrated significant utilities in real-world applications.
Benchmark evaluations are crucial for assessing the capabilities of LLMs.
arXiv Detail & Related papers (2025-02-13T03:43:33Z) - Large Language Models Meet Symbolic Provers for Logical Reasoning Evaluation [24.081573908824353]
First-order logic (FOL) reasoning is pivotal for intelligent systems.
Existing benchmarks often rely on extensive human annotation or handcrafted templates.
We propose a novel framework called ProverGen that synergizes the generative strengths of Large Language Models with the rigor and precision of symbolic provers.
arXiv Detail & Related papers (2025-02-10T15:31:54Z) - The Vulnerability of Language Model Benchmarks: Do They Accurately Reflect True LLM Performance? [1.3810901729134184]
Large Language Models (LLMs) excel at standardized tests while failing to demonstrate genuine language understanding and adaptability.
Our systematic analysis of NLP evaluation frameworks reveals pervasive vulnerabilities across the evaluation spectrum.
We lay the groundwork for new evaluation methods that resist manipulation, minimize data contamination, and assess domain-specific tasks.
arXiv Detail & Related papers (2024-12-02T20:49:21Z) - StructEval: Deepen and Broaden Large Language Model Assessment via Structured Evaluation [46.59416831869014]
We propose a novel evaluation framework referred to as StructEval.
Starting from an atomic test objective, StructEval deepens and broadens the evaluation by conducting a structured assessment across multiple cognitive levels and critical concepts.
Experiments on three widely-used benchmarks demonstrate that StructEval serves as a reliable tool for resisting the risk of data contamination.
arXiv Detail & Related papers (2024-08-06T16:28:30Z) - Attribute Structuring Improves LLM-Based Evaluation of Clinical Text Summaries [56.31117605097345]
Large language models (LLMs) have shown the potential to generate accurate clinical text summaries, but still struggle with issues regarding grounding and evaluation.
Here, we explore a general mitigation framework using Attribute Structuring (AS), which structures the summary evaluation process.
AS consistently improves the correspondence between human annotations and automated metrics in clinical text summarization.
arXiv Detail & Related papers (2024-03-01T21:59:03Z) - HD-Eval: Aligning Large Language Model Evaluators Through Hierarchical
Criteria Decomposition [92.17397504834825]
HD-Eval is a framework that iteratively aligns large language models evaluators with human preference.
HD-Eval inherits the essence from the evaluation mindset of human experts and enhances the alignment of LLM-based evaluators.
Extensive experiments on three evaluation domains demonstrate the superiority of HD-Eval in further aligning state-of-the-art evaluators.
arXiv Detail & Related papers (2024-02-24T08:01:32Z) - FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [69.91340332545094]
We introduce FLASK, a fine-grained evaluation protocol for both human-based and model-based evaluation.
We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance.
arXiv Detail & Related papers (2023-07-20T14:56:35Z) - DecompEval: Evaluating Generated Texts as Unsupervised Decomposed
Question Answering [95.89707479748161]
Existing evaluation metrics for natural language generation (NLG) tasks face the challenges on generalization ability and interpretability.
We propose a metric called DecompEval that formulates NLG evaluation as an instruction-style question answering task.
We decompose our devised instruction-style question about the quality of generated texts into the subquestions that measure the quality of each sentence.
The subquestions with their answers generated by PLMs are then recomposed as evidence to obtain the evaluation result.
arXiv Detail & Related papers (2023-07-13T16:16:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.