On the Robustness of Language Models for Tabular Question Answering
- URL: http://arxiv.org/abs/2406.12719v1
- Date: Tue, 18 Jun 2024 15:41:15 GMT
- Title: On the Robustness of Language Models for Tabular Question Answering
- Authors: Kushal Raj Bhandari, Sixue Xing, Soham Dan, Jianxi Gao,
- Abstract summary: Large Language Models (LLMs) have been shown to tackle table comprehension tasks without specific training.
We evaluate the robustness of LLMs on Wikipedia-based $textbfWTQ$ and financial report-based $textbfTAT-QA$ TQA datasets.
- Score: 7.486549276995143
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs), originally shown to ace various text comprehension tasks have also remarkably been shown to tackle table comprehension tasks without specific training. While previous research has explored LLM capabilities with tabular dataset tasks, our study assesses the influence of $\textit{in-context learning}$,$ \textit{model scale}$, $\textit{instruction tuning}$, and $\textit{domain biases}$ on Tabular Question Answering (TQA). We evaluate the robustness of LLMs on Wikipedia-based $\textbf{WTQ}$ and financial report-based $\textbf{TAT-QA}$ TQA datasets, focusing on their ability to robustly interpret tabular data under various augmentations and perturbations. Our findings indicate that instructions significantly enhance performance, with recent models like Llama3 exhibiting greater robustness over earlier versions. However, data contamination and practical reliability issues persist, especially with WTQ. We highlight the need for improved methodologies, including structure-aware self-attention mechanisms and better handling of domain-specific tabular data, to develop more reliable LLMs for table comprehension.
Related papers
- Enhancing Temporal Understanding in LLMs for Semi-structured Tables [50.59009084277447]
We conduct a comprehensive analysis of temporal datasets to pinpoint the specific limitations of large language models (LLMs)
Our investigation leads to enhancements in TempTabQA, a dataset specifically designed for temporal temporal question answering.
We introduce a novel approach, C.L.E.A.R. to strengthen LLM capabilities in this domain.
arXiv Detail & Related papers (2024-07-22T20:13:10Z) - Uncovering Limitations of Large Language Models in Information Seeking from Tables [28.19697259795014]
This paper introduces a more reliable benchmark for Table Information Seeking (TabIS)
To avoid the unreliable evaluation caused by text similarity-based metrics, TabIS adopts a single-choice question format (with two options per question) instead of a text generation format.
arXiv Detail & Related papers (2024-06-06T14:30:59Z) - TACT: Advancing Complex Aggregative Reasoning with Information Extraction Tools [51.576974932743596]
Large Language Models (LLMs) often do not perform well on queries that require the aggregation of information across texts.
To better evaluate this setting and facilitate modeling efforts, we introduce TACT - Text And Calculations through Tables.
TACT contains challenging instructions that demand stitching information scattered across one or more texts, and performing complex integration on this information to generate the answer.
arXiv Detail & Related papers (2024-06-05T20:32:56Z) - Unleashing the Potential of Large Language Models for Predictive Tabular Tasks in Data Science [17.910306140400046]
This research endeavors to apply Large Language Models (LLMs) towards addressing these predictive tasks.
Our research aims to mitigate this gap by compiling a comprehensive corpus of tables annotated with instructions and executing large-scale training of Llama-2.
arXiv Detail & Related papers (2024-03-29T14:41:21Z) - Evaluating LLMs' Mathematical Reasoning in Financial Document Question
Answering [53.56653281752486]
This study explores Large Language Models' mathematical reasoning on four financial question-answering datasets.
We focus on sensitivity to table complexity and performance variations with an increasing number of arithmetic reasoning steps.
We introduce a novel prompting technique tailored to semi-structured documents, matching or outperforming other baselines in performance.
arXiv Detail & Related papers (2024-02-17T05:10:18Z) - TAP4LLM: Table Provider on Sampling, Augmenting, and Packing
Semi-structured Data for Large Language Model Reasoning [58.11442663694328]
We propose TAP4LLM as a versatile pre-processing toolbox to generate table prompts.
In each module, we collect and design several common methods for usage in various scenarios.
arXiv Detail & Related papers (2023-12-14T15:37:04Z) - HeLM: Highlighted Evidence augmented Language Model for Enhanced Table-to-Text Generation [7.69801337810352]
We conduct parameter-efficient fine-tuning on the LLaMA2 model.
Our approach involves injecting reasoning information into the input by emphasizing table-specific row data.
On both the FetaQA and QTSumm datasets, our approach achieved state-of-the-art results.
arXiv Detail & Related papers (2023-11-15T12:02:52Z) - QTSumm: Query-Focused Summarization over Tabular Data [58.62152746690958]
People primarily consult tables to conduct data analysis or answer specific questions.
We define a new query-focused table summarization task, where text generation models have to perform human-like reasoning.
We introduce a new benchmark named QTSumm for this task, which contains 7,111 human-annotated query-summary pairs over 2,934 tables.
arXiv Detail & Related papers (2023-05-23T17:43:51Z) - Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study [44.39031420687302]
Large language models (LLMs) are becoming attractive as few-shot reasoners to solve Natural Language (NL)-related tasks.
We try to understand this by designing a benchmark to evaluate the structural understanding capabilities of LLMs.
We propose $textitself-augmentation$ for effective structural prompting, such as critical value / range identification.
arXiv Detail & Related papers (2023-05-22T14:23:46Z) - TABLET: Learning From Instructions For Tabular Data [46.62140500101618]
We introduce TABLET, a benchmark of 20 diverse datasets annotated with instructions that vary in their phrasing, granularity, and technicality.
We find in-context instructions increase zero-shot F1 performance for Flan-T5 11b by 44% on average and 13% for ChatGPT on TABLET.
arXiv Detail & Related papers (2023-04-25T23:07:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.