On the Robustness of Language Models for Tabular Question Answering
- URL: http://arxiv.org/abs/2406.12719v1
- Date: Tue, 18 Jun 2024 15:41:15 GMT
- Title: On the Robustness of Language Models for Tabular Question Answering
- Authors: Kushal Raj Bhandari, Sixue Xing, Soham Dan, Jianxi Gao,
- Abstract summary: Large Language Models (LLMs) have been shown to tackle table comprehension tasks without specific training.
We evaluate the robustness of LLMs on Wikipedia-based $textbfWTQ$ and financial report-based $textbfTAT-QA$ TQA datasets.
- Score: 7.486549276995143
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs), originally shown to ace various text comprehension tasks have also remarkably been shown to tackle table comprehension tasks without specific training. While previous research has explored LLM capabilities with tabular dataset tasks, our study assesses the influence of $\textit{in-context learning}$,$ \textit{model scale}$, $\textit{instruction tuning}$, and $\textit{domain biases}$ on Tabular Question Answering (TQA). We evaluate the robustness of LLMs on Wikipedia-based $\textbf{WTQ}$ and financial report-based $\textbf{TAT-QA}$ TQA datasets, focusing on their ability to robustly interpret tabular data under various augmentations and perturbations. Our findings indicate that instructions significantly enhance performance, with recent models like Llama3 exhibiting greater robustness over earlier versions. However, data contamination and practical reliability issues persist, especially with WTQ. We highlight the need for improved methodologies, including structure-aware self-attention mechanisms and better handling of domain-specific tabular data, to develop more reliable LLMs for table comprehension.
Related papers
- TableRAG: Million-Token Table Understanding with Language Models [53.039560091592215]
TableRAG is a Retrieval-Augmented Generation (RAG) framework specifically designed for LM-based table understanding.
TableRAG leverages query expansion combined with schema and cell retrieval to pinpoint crucial information before providing it to the LMs.
Our results demonstrate that TableRAG achieves the highest retrieval quality, leading to the new state-of-the-art performance on large-scale table understanding.
arXiv Detail & Related papers (2024-10-07T04:15:02Z) - Knowledge in Triples for LLMs: Enhancing Table QA Accuracy with Semantic Extraction [1.0968343822308813]
This paper proposes a novel approach that extracts triples straightforward from tabular data and integrates it with a retrieval-augmented generation (RAG) model to enhance the accuracy, coherence, and contextual richness of responses generated by a fine-tuned GPT-3.5-turbo-0125 model.
Our approach significantly outperforms existing baselines on the FeTaQA dataset, particularly excelling in Sacre-BLEU and ROUGE metrics.
arXiv Detail & Related papers (2024-09-21T16:46:15Z) - Enhancing Temporal Understanding in LLMs for Semi-structured Tables [50.59009084277447]
We conduct a comprehensive analysis of temporal datasets to pinpoint the specific limitations of large language models (LLMs)
Our investigation leads to enhancements in TempTabQA, a dataset specifically designed for temporal temporal question answering.
We introduce a novel approach, C.L.E.A.R. to strengthen LLM capabilities in this domain.
arXiv Detail & Related papers (2024-07-22T20:13:10Z) - Uncovering Limitations of Large Language Models in Information Seeking from Tables [28.19697259795014]
This paper introduces a more reliable benchmark for Table Information Seeking (TabIS)
To avoid the unreliable evaluation caused by text similarity-based metrics, TabIS adopts a single-choice question format (with two options per question) instead of a text generation format.
arXiv Detail & Related papers (2024-06-06T14:30:59Z) - TACT: Advancing Complex Aggregative Reasoning with Information Extraction Tools [51.576974932743596]
Large Language Models (LLMs) often do not perform well on queries that require the aggregation of information across texts.
TACT contains challenging instructions that demand stitching information scattered across one or more texts.
We construct this dataset by leveraging an existing dataset of texts and their associated tables.
We demonstrate that all contemporary LLMs perform poorly on this dataset, achieving an accuracy below 38%.
arXiv Detail & Related papers (2024-06-05T20:32:56Z) - Evaluating LLMs' Mathematical Reasoning in Financial Document Question
Answering [53.56653281752486]
This study explores Large Language Models' mathematical reasoning on four financial question-answering datasets.
We focus on sensitivity to table complexity and performance variations with an increasing number of arithmetic reasoning steps.
We introduce a novel prompting technique tailored to semi-structured documents, matching or outperforming other baselines in performance.
arXiv Detail & Related papers (2024-02-17T05:10:18Z) - TAP4LLM: Table Provider on Sampling, Augmenting, and Packing Semi-structured Data for Large Language Model Reasoning [55.33939289989238]
We propose TAP4LLM as a versatile pre-processor suite for leveraging large language models (LLMs) in table-based tasks effectively.
It covers several distinct components: (1) table sampling to decompose large tables into manageable sub-tables based on query semantics, (2) table augmentation to enhance tables with additional knowledge from external sources or models, and (3) table packing & serialization to convert tables into various formats suitable for LLMs' understanding.
arXiv Detail & Related papers (2023-12-14T15:37:04Z) - HeLM: Highlighted Evidence augmented Language Model for Enhanced Table-to-Text Generation [7.69801337810352]
We conduct parameter-efficient fine-tuning on the LLaMA2 model.
Our approach involves injecting reasoning information into the input by emphasizing table-specific row data.
On both the FetaQA and QTSumm datasets, our approach achieved state-of-the-art results.
arXiv Detail & Related papers (2023-11-15T12:02:52Z) - QTSumm: Query-Focused Summarization over Tabular Data [58.62152746690958]
People primarily consult tables to conduct data analysis or answer specific questions.
We define a new query-focused table summarization task, where text generation models have to perform human-like reasoning.
We introduce a new benchmark named QTSumm for this task, which contains 7,111 human-annotated query-summary pairs over 2,934 tables.
arXiv Detail & Related papers (2023-05-23T17:43:51Z) - Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study [44.39031420687302]
Large language models (LLMs) are becoming attractive as few-shot reasoners to solve Natural Language (NL)-related tasks.
We try to understand this by designing a benchmark to evaluate the structural understanding capabilities of LLMs.
We propose $textitself-augmentation$ for effective structural prompting, such as critical value / range identification.
arXiv Detail & Related papers (2023-05-22T14:23:46Z) - TABLET: Learning From Instructions For Tabular Data [46.62140500101618]
We introduce TABLET, a benchmark of 20 diverse datasets annotated with instructions that vary in their phrasing, granularity, and technicality.
We find in-context instructions increase zero-shot F1 performance for Flan-T5 11b by 44% on average and 13% for ChatGPT on TABLET.
arXiv Detail & Related papers (2023-04-25T23:07:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.