Related papers: Towards a Holistic and Automated Evaluation Framework for Multi-Level Comprehension of LLMs in Book-Length Contexts

Towards a Holistic and Automated Evaluation Framework for Multi-Level Comprehension of LLMs in Book-Length Contexts

URL: http://arxiv.org/abs/2508.19578v1
Date: Wed, 27 Aug 2025 05:23:22 GMT
Title: Towards a Holistic and Automated Evaluation Framework for Multi-Level Comprehension of LLMs in Book-Length Contexts
Authors: Jiaqi Deng, Yuho Lee, Nicole Hee-Yeon Kim, Hyangsuk Min, Taewon Yun, Minjeong Ban, Kim Yul, Hwanjun Song,
Abstract summary: HAMLET is a framework for evaluating the long-context comprehension of large language models.<n>It structures texts into a three-level key-fact hierarchy at root, branch, and leaf-levels.<n>It employs query-focused summarization to evaluate how well models recall and faithfully represent information at each level.
Score: 19.640586886024952
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce HAMLET, a holistic and automated framework for evaluating the long-context comprehension of large language models (LLMs). HAMLET structures source texts into a three-level key-fact hierarchy at root-, branch-, and leaf-levels, and employs query-focused summarization to evaluate how well models recall and faithfully represent information at each level. To validate the reliability of our fully automated pipeline, we conduct a systematic human study, showing that our automatic evaluation achieves over 90% agreement with expert human judgments, while reducing the cost by up to 25 times. HAMLET reveals that LLMs struggle with fine-grained comprehension, especially at the leaf level, and are sensitive to positional effects like the lost-in-the-middle. Analytical queries pose greater challenges than narrative ones, and consistent performance gaps emerge between open-source and proprietary models, as well as across model scales. Our code and dataset are publicly available at https://github.com/DISL-Lab/HAMLET.

Related papers

Beyond Isolated Dots: Benchmarking Structured Table Construction as Deep Knowledge Extraction [80.88654868264645]
Arranged and Organized Extraction Benchmark designed to evaluate ability of large language models to comprehend fragmented documents.<n>AOE includes 11 carefully crafted tasks across three diverse domains, requiring models to generate context-specific schema tailored to varied input queries.<n>Results show that even the most advanced models struggled significantly.
arXiv Detail & Related papers (2025-07-22T06:37:51Z)
IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis [60.32962597618861]
IDA-Bench is a novel benchmark evaluating large language models in multi-round interactive scenarios.<n>Agent performance is judged by comparing its final numerical output to the human-derived baseline.<n>Even state-of-the-art coding agents (like Claude-3.7-thinking) succeed on 50% of the tasks, highlighting limitations not evident in single-turn tests.
arXiv Detail & Related papers (2025-05-23T09:37:52Z)
SCAN: Structured Capability Assessment and Navigation for LLMs [54.54085382131134]
textbfSCAN (Structured Capability Assessment and Navigation) is a practical framework that enables detailed characterization of Large Language Models.<n>SCAN incorporates four key components:.<n>TaxBuilder, which extracts capability-indicating tags from queries to construct a hierarchical taxonomy;.<n>RealMix, a query synthesis and filtering mechanism that ensures sufficient evaluation data for each capability tag;.<n>A PC$2$-based (Pre-Comparison-derived Criteria) LLM-as-a-Judge approach achieves significantly higher accuracy compared to classic LLM-as-a-Judge method
arXiv Detail & Related papers (2025-05-10T16:52:40Z)
Potential and Perils of Large Language Models as Judges of Unstructured Textual Data [0.631976908971572]
This research investigates the effectiveness of LLM-as-judge models to evaluate the thematic alignment of summaries generated by other LLMs.<n>Our findings reveal that while LLM-as-judge offer a scalable solution comparable to human raters, humans may still excel at detecting subtle, context-specific nuances.
arXiv Detail & Related papers (2025-01-14T14:49:14Z)
Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation [65.16137964758612]
We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books. Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text.
arXiv Detail & Related papers (2024-05-31T20:15:10Z)
Recurrent Alignment with Hard Attention for Hierarchical Text Rating [6.858867989434858]
We propose a novel framework for hierarchical text rating utilizing large language models (LLMs) Our framework incorporates Recurrent Alignment with Hard Attention (RAHA) Experimental results demonstrate that RAHA outperforms existing state-of-the-art methods on three hierarchical text rating datasets.
arXiv Detail & Related papers (2024-02-14T00:40:51Z)
Large Language Models as Automated Aligners for benchmarking Vision-Language Models [48.4367174400306]
Vision-Language Models (VLMs) have reached a new level of sophistication, showing notable competence in executing intricate cognition and reasoning tasks. Existing evaluation benchmarks, primarily relying on rigid, hand-crafted datasets, face significant limitations in assessing the alignment of these increasingly anthropomorphic models with human intelligence. In this work, we address the limitations via Auto-Bench, which delves into exploring LLMs as proficient curation, measuring the alignment betweenVLMs and human intelligence and value through automatic data curation and assessment.
arXiv Detail & Related papers (2023-11-24T16:12:05Z)
Revisit Input Perturbation Problems for LLMs: A Unified Robustness Evaluation Framework for Noisy Slot Filling Task [18.623619585980688]
We propose a unified robustness evaluation framework based on the slot-filling task to evaluate the dialogue understanding capability of large language models. Specifically, we construct a input perturbation evaluation dataset, Noise-LLM, which contains five types of single perturbation and four types of mixed perturbation data. Our aim is to assess how well various robustness methods of LLMs perform in real-world noisy scenarios.
arXiv Detail & Related papers (2023-10-10T10:22:05Z)
Improving Open Information Extraction with Large Language Models: A Study on Demonstration Uncertainty [52.72790059506241]
Open Information Extraction (OIE) task aims at extracting structured facts from unstructured text. Despite the potential of large language models (LLMs) like ChatGPT as a general task solver, they lag behind state-of-the-art (supervised) methods in OIE tasks.
arXiv Detail & Related papers (2023-09-07T01:35:24Z)
KoLA: Carefully Benchmarking World Knowledge of Large Language Models [87.96683299084788]
We construct a Knowledge-oriented LLM Assessment benchmark (KoLA) We mimic human cognition to form a four-level taxonomy of knowledge-related abilities, covering $19$ tasks. We use both Wikipedia, a corpus prevalently pre-trained by LLMs, along with continuously collected emerging corpora, to evaluate the capacity to handle unseen data and evolving knowledge.
arXiv Detail & Related papers (2023-06-15T17:20:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.