Can LLMs Generate Tabular Summaries of Science Papers? Rethinking the Evaluation Protocol
- URL: http://arxiv.org/abs/2504.10284v2
- Date: Tue, 22 Apr 2025 03:09:56 GMT
- Title: Can LLMs Generate Tabular Summaries of Science Papers? Rethinking the Evaluation Protocol
- Authors: Weiqi Wang, Jiefu Ou, Yangqiu Song, Benjamin Van Durme, Daniel Khashabi,
- Abstract summary: Literature review tables are essential for summarizing and comparing collections of scientific papers.<n>We explore the task of generating tables that best fulfill a user's informational needs given a collection of scientific papers.<n>Our contributions focus on three key challenges encountered in real-world use: (i) User prompts are often under-specified; (ii) Retrieved candidate papers frequently contain irrelevant content; and (iii) Task evaluation should move beyond shallow text similarity techniques.
- Score: 83.90769864167301
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Literature review tables are essential for summarizing and comparing collections of scientific papers. We explore the task of generating tables that best fulfill a user's informational needs given a collection of scientific papers. Building on recent work (Newman et al., 2024), we extend prior approaches to address real-world complexities through a combination of LLM-based methods and human annotations. Our contributions focus on three key challenges encountered in real-world use: (i) User prompts are often under-specified; (ii) Retrieved candidate papers frequently contain irrelevant content; and (iii) Task evaluation should move beyond shallow text similarity techniques and instead assess the utility of inferred tables for information-seeking tasks (e.g., comparing papers). To support reproducible evaluation, we introduce ARXIV2TABLE, a more realistic and challenging benchmark for this task, along with a novel approach to improve literature review table generation in real-world scenarios. Our extensive experiments on this benchmark show that both open-weight and proprietary LLMs struggle with the task, highlighting its difficulty and the need for further advancements. Our dataset and code are available at https://github.com/JHU-CLSP/arXiv2Table.
Related papers
- LitLLMs, LLMs for Literature Review: Are we there yet? [15.785989492351684]
This paper explores the zero-shot abilities of recent Large Language Models in assisting with the writing of literature reviews based on an abstract.<n>For retrieval, we introduce a novel two-step search strategy that first uses an LLM to extract meaningful keywords from the abstract of a paper.<n>In the generation phase, we propose a two-step approach that first outlines a plan for the review and then executes steps in the plan to generate the actual review.
arXiv Detail & Related papers (2024-12-15T01:12:26Z) - ArxivDIGESTables: Synthesizing Scientific Literature into Tables using Language Models [58.34560740973768]
We introduce a framework that leverages language models (LMs) to generate literature review tables.
A new dataset of 2,228 literature review tables extracted from ArXiv papers synthesize a total of 7,542 research papers.
We evaluate LMs' abilities to reconstruct reference tables, finding this task benefits from additional context.
arXiv Detail & Related papers (2024-10-25T18:31:50Z) - Integrating Planning into Single-Turn Long-Form Text Generation [66.08871753377055]
We propose to use planning to generate long form content.
Our main novelty lies in a single auxiliary task that does not require multiple rounds of prompting or planning.
Our experiments demonstrate on two datasets from different domains, that LLMs fine-tuned with the auxiliary task generate higher quality documents.
arXiv Detail & Related papers (2024-10-08T17:02:40Z) - Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases.
We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning.
Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z) - QFMTS: Generating Query-Focused Summaries over Multi-Table Inputs [63.98556480088152]
Table summarization is a crucial task aimed at condensing information into concise and comprehensible textual summaries.
We propose a novel method to address these limitations by introducing query-focused multi-table summarization.
Our approach, which comprises a table serialization module, a summarization controller, and a large language model, generates query-dependent table summaries tailored to users' information needs.
arXiv Detail & Related papers (2024-05-08T15:05:55Z) - Text-Tuple-Table: Towards Information Integration in Text-to-Table Generation via Global Tuple Extraction [36.915250638481986]
We introduce LiveSum, a new benchmark dataset for generating summary tables of competitions based on real-time commentary texts.<n>We evaluate the performances of state-of-the-art Large Language Models on this task in both fine-tuning and zero-shot settings.<n>We additionally propose a novel pipeline called $T3$(Text-Tuple-Table) to improve their performances.
arXiv Detail & Related papers (2024-04-22T14:31:28Z) - STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases [93.96463520716759]
We develop STARK, a large-scale Semi-structure retrieval benchmark on Textual and Knowledge Bases.
Our benchmark covers three domains: product search, academic paper search, and queries in precision medicine.
We design a novel pipeline to synthesize realistic user queries that integrate diverse relational information and complex textual properties.
arXiv Detail & Related papers (2024-04-19T22:54:54Z) - Benchmarking LLMs on the Semantic Overlap Summarization Task [9.656095701778975]
This paper comprehensively evaluates Large Language Models (LLMs) on the Semantic Overlap Summarization (SOS) task.
We report well-established metrics like ROUGE, BERTscore, and SEM-F1$ on two different datasets of alternative narratives.
arXiv Detail & Related papers (2024-02-26T20:33:50Z) - All Data on the Table: Novel Dataset and Benchmark for Cross-Modality
Scientific Information Extraction [39.05577374775964]
We propose a semi-supervised pipeline for annotating entities in text, as well as entities and relations in tables, in an iterative procedure.
We release novel resources for the scientific community, including a high-quality benchmark, a large-scale corpus, and a semi-supervised annotation pipeline.
arXiv Detail & Related papers (2023-11-14T14:22:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.