QTSumm: Query-Focused Summarization over Tabular Data
- URL: http://arxiv.org/abs/2305.14303v2
- Date: Tue, 7 Nov 2023 04:53:07 GMT
- Title: QTSumm: Query-Focused Summarization over Tabular Data
- Authors: Yilun Zhao, Zhenting Qi, Linyong Nan, Boyu Mi, Yixin Liu, Weijin Zou,
Simeng Han, Ruizhe Chen, Xiangru Tang, Yumo Xu, Dragomir Radev, Arman Cohan
- Abstract summary: People primarily consult tables to conduct data analysis or answer specific questions.
We define a new query-focused table summarization task, where text generation models have to perform human-like reasoning.
We introduce a new benchmark named QTSumm for this task, which contains 7,111 human-annotated query-summary pairs over 2,934 tables.
- Score: 58.62152746690958
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: People primarily consult tables to conduct data analysis or answer specific
questions. Text generation systems that can provide accurate table summaries
tailored to users' information needs can facilitate more efficient access to
relevant data insights. Motivated by this, we define a new query-focused table
summarization task, where text generation models have to perform human-like
reasoning and analysis over the given table to generate a tailored summary. We
introduce a new benchmark named QTSumm for this task, which contains 7,111
human-annotated query-summary pairs over 2,934 tables covering diverse topics.
We investigate a set of strong baselines on QTSumm, including text generation,
table-to-text generation, and large language models. Experimental results and
manual analysis reveal that the new task presents significant challenges in
table-to-text generation for future research. Moreover, we propose a new
approach named ReFactor, to retrieve and reason over query-relevant information
from tabular data to generate several natural language facts. Experimental
results demonstrate that ReFactor can bring improvements to baselines by
concatenating the generated facts to the model input. Our data and code are
publicly available at https://github.com/yale-nlp/QTSumm.
Related papers
- TACT: Advancing Complex Aggregative Reasoning with Information Extraction Tools [51.576974932743596]
Large Language Models (LLMs) often do not perform well on queries that require the aggregation of information across texts.
TACT contains challenging instructions that demand stitching information scattered across one or more texts.
We construct this dataset by leveraging an existing dataset of texts and their associated tables.
We demonstrate that all contemporary LLMs perform poorly on this dataset, achieving an accuracy below 38%.
arXiv Detail & Related papers (2024-06-05T20:32:56Z) - TANQ: An open domain dataset of table answered questions [15.323690523538572]
TANQ is the first open domain question answering dataset where the answers require building tables from information across multiple sources.
We release the full source attribution for every cell in the resulting table and benchmark state-of-the-art language models in open, oracle, and closed book setups.
Our best-performing baseline, GPT4 reaches an overall F1 score of 29.1, lagging behind human performance by 19.7 points.
arXiv Detail & Related papers (2024-05-13T14:07:20Z) - QFMTS: Generating Query-Focused Summaries over Multi-Table Inputs [63.98556480088152]
Table summarization is a crucial task aimed at condensing information into concise and comprehensible textual summaries.
We propose a novel method to address these limitations by introducing query-focused multi-table summarization.
Our approach, which comprises a table serialization module, a summarization controller, and a large language model, generates query-dependent table summaries tailored to users' information needs.
arXiv Detail & Related papers (2024-05-08T15:05:55Z) - Text-Tuple-Table: Towards Information Integration in Text-to-Table Generation via Global Tuple Extraction [36.915250638481986]
We introduce LiveSum, a new benchmark dataset for generating summary tables of competitions based on real-time commentary texts.
We evaluate the performances of state-of-the-art Large Language Models on this task in both fine-tuning and zero-shot settings.
We additionally propose a novel pipeline called $T3$(Text-Tuple-Table) to improve their performances.
arXiv Detail & Related papers (2024-04-22T14:31:28Z) - Text2Analysis: A Benchmark of Table Question Answering with Advanced
Data Analysis and Unclear Queries [67.0083902913112]
We develop the Text2Analysis benchmark, incorporating advanced analysis tasks.
We also develop five innovative and effective annotation methods.
We evaluate five state-of-the-art models using three different metrics.
arXiv Detail & Related papers (2023-12-21T08:50:41Z) - ReTAG: Reasoning Aware Table to Analytic Text Generation [12.603569641254417]
ReTAG is a table and reasoning aware model that uses vector-quantization to infuse different types of analytical reasoning into the output.
We extend (and open source 35.6K analytical, 55.9k descriptive instances) the ToTTo, InfoTabs datasets with the reasoning categories used in each reference sentences.
arXiv Detail & Related papers (2023-05-19T17:03:09Z) - Towards Complex Document Understanding By Discrete Reasoning [77.91722463958743]
Document Visual Question Answering (VQA) aims to understand visually-rich documents to answer questions in natural language.
We introduce a new Document VQA dataset, named TAT-DQA, which consists of 3,067 document pages and 16,558 question-answer pairs.
We develop a novel model named MHST that takes into account the information in multi-modalities, including text, layout and visual image, to intelligently address different types of questions.
arXiv Detail & Related papers (2022-07-25T01:43:19Z) - FeTaQA: Free-form Table Question Answering [33.018256483762386]
We introduce FeTaQA, a new dataset with 10K Wikipedia-based table, question, free-form answer, supporting table cells pairs.
FeTaQA yields a more challenging table question answering setting because it requires generating free-form text answers after retrieval, inference, and integration of multiple discontinuous facts from a structured knowledge source.
arXiv Detail & Related papers (2021-04-01T09:59:40Z) - Summarizing and Exploring Tabular Data in Conversational Search [36.14882974814593]
We build a new conversation-oriented, open-domain table summarization dataset.
It includes annotated table summaries, which not only answer questions but also help people explore other information in the table.
We utilize this dataset to develop automatic table summarization systems as SOTA baselines.
arXiv Detail & Related papers (2020-05-23T08:29:51Z) - ToTTo: A Controlled Table-To-Text Generation Dataset [61.83159452483026]
ToTTo is an open-domain English table-to-text dataset with over 120,000 training examples.
We introduce a dataset construction process where annotators directly revise existing candidate sentences from Wikipedia.
While usually fluent, existing methods often hallucinate phrases that are not supported by the table.
arXiv Detail & Related papers (2020-04-29T17:53:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.