A Hybrid Search for Complex Table Question Answering in Securities Report
- URL: http://arxiv.org/abs/2511.09179v1
- Date: Thu, 13 Nov 2025 01:37:47 GMT
- Title: A Hybrid Search for Complex Table Question Answering in Securities Report
- Authors: Daiki Shirafuji, Koji Tanaka, Tatsuhiko Saito,
- Abstract summary: We propose a cell extraction method for Table Question Answering (TQA) without manual identification.<n>Our approach estimates table headers by computing similarities between a given question and individual cells.<n>We then select as the answer the cells at the intersection of the most relevant row and column.
- Score: 0.9430947207126281
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recently, Large Language Models (LLMs) are gaining increased attention in the domain of Table Question Answering (TQA), particularly for extracting information from tables in documents. However, directly entering entire tables as long text into LLMs often leads to incorrect answers because most LLMs cannot inherently capture complex table structures. In this paper, we propose a cell extraction method for TQA without manual identification, even for complex table headers. Our approach estimates table headers by computing similarities between a given question and individual cells via a hybrid retrieval mechanism that integrates a language model and TF-IDF. We then select as the answer the cells at the intersection of the most relevant row and column. Furthermore, the language model is trained using contrastive learning on a small dataset of question-header pairs to enhance performance. We evaluated our approach in the TQA dataset from the U4 shared task at NTCIR-18. The experimental results show that our pipeline achieves an accuracy of 74.6\%, outperforming existing LLMs such as GPT-4o mini~(63.9\%). In the future, although we used traditional encoder models for retrieval in this study, we plan to incorporate more efficient text-search models to improve performance and narrow the gap with human evaluation results.
Related papers
- CGPT: Cluster-Guided Partial Tables with LLM-Generated Supervision for Table Retrieval [1.483000637348699]
We introduce CGPT, a training framework that enhances table retrieval through LLM-generated supervision.<n>CGPT consistently outperforms retrieval baselines, including QGpT, with an average R@1 improvement of 16.54 percent.<n>Results indicate that semantically guided partial-table construction, combined with contrastive training from LLM-generated supervision, provides an effective and scalable paradigm for large-scale table retrieval.
arXiv Detail & Related papers (2026-01-22T10:58:56Z) - CORE-T: COherent REtrieval of Tables for Text-to-SQL [91.76918495375384]
CORE-T is a scalable, training-free framework that enriches tables with purpose metadata and pre-computes a lightweight table-compatibility cache.<n>Across Bird, Spider, and MMQA, CORE-T improves table-selection F1 by up to 22.7 points while retrieving up to 42% fewer tables.
arXiv Detail & Related papers (2026-01-19T14:51:23Z) - Agentic LLMs for Question Answering over Tabular Data [6.310433217813068]
Question Answering over Tabular Data (Table QA) presents unique challenges due to the diverse structure, size, and data types of real-world tables.<n>This paper details our methodology, experimental results, and alternative approaches, providing insights into the strengths and limitations of Table QA.
arXiv Detail & Related papers (2025-09-11T08:12:38Z) - Improving Table Retrieval with Question Generation from Partial Tables [2.2169618382995764]
We propose QGpT, a simple yet effective method that uses an LLM to generate synthetic questions based on small portions of a table.<n>The generated questions are then jointly embedded with the partial table segments used for generation, enhancing semantic alignment with user queries.
arXiv Detail & Related papers (2025-08-08T09:35:56Z) - CRAFT: Training-Free Cascaded Retrieval for Tabular QA [11.984180880537936]
Table Question Answering (TQA) involves retrieving relevant tables from a large corpus to answer natural language queries.<n>textbfCRAFT$ is a cascaded retrieval approach that first uses a sparse retrieval model to filter a subset of candidate tables.<n>textbfCRAFT$ achieves better retrieval performance than state-of-the-art (SOTA) sparse, dense, and hybrid retrievers.
arXiv Detail & Related papers (2025-05-21T00:09:34Z) - RAG over Tables: Hierarchical Memory Index, Multi-Stage Retrieval, and Benchmarking [63.253294691180635]
In real-world scenarios, beyond pure text, a substantial amount of knowledge is stored in tables.<n>We first propose a table-corpora-aware RAG framework, named T-RAG, which consists of the hierarchical memory index, multi-stage retrieval, and graph-aware prompting.
arXiv Detail & Related papers (2025-04-02T04:24:41Z) - Leveraging Foundation Language Models (FLMs) for Automated Cohort Extraction from Large EHR Databases [50.552056536968166]
We propose and evaluate an algorithm for automating column matching on two large, popular and publicly-accessible EHR databases.<n>Our approach achieves a high top-three accuracy of $92%$, correctly matching $12$ out of the $13$ columns of interest, when using a small, pre-trained general purpose language model.
arXiv Detail & Related papers (2024-12-16T06:19:35Z) - Knowledge in Triples for LLMs: Enhancing Table QA Accuracy with Semantic Extraction [1.0968343822308813]
This paper proposes a novel approach that extracts triples straightforward from tabular data and integrates it with a retrieval-augmented generation (RAG) model to enhance the accuracy, coherence, and contextual richness of responses generated by a fine-tuned GPT-3.5-turbo-0125 model.
Our approach significantly outperforms existing baselines on the FeTaQA dataset, particularly excelling in Sacre-BLEU and ROUGE metrics.
arXiv Detail & Related papers (2024-09-21T16:46:15Z) - TACT: Advancing Complex Aggregative Reasoning with Information Extraction Tools [51.576974932743596]
Large Language Models (LLMs) often do not perform well on queries that require the aggregation of information across texts.
TACT contains challenging instructions that demand stitching information scattered across one or more texts.
We construct this dataset by leveraging an existing dataset of texts and their associated tables.
We demonstrate that all contemporary LLMs perform poorly on this dataset, achieving an accuracy below 38%.
arXiv Detail & Related papers (2024-06-05T20:32:56Z) - KET-QA: A Dataset for Knowledge Enhanced Table Question Answering [63.56707527868466]
We propose to use a knowledge base (KB) as the external knowledge source for TableQA.
Every question requires the integration of information from both the table and the sub-graph to be answered.
We design a retriever-reasoner structured pipeline model to extract pertinent information from the vast knowledge sub-graph.
arXiv Detail & Related papers (2024-05-13T18:26:32Z) - Localize, Retrieve and Fuse: A Generalized Framework for Free-Form
Question Answering over Tables [46.039687237878105]
TableQA aims at generating answers to questions grounded on a provided table.
Table-to- Graph conversion, cell localizing, external knowledge retrieval, and the fusion of table and text are proposed.
Experiments showcase the superior capabilities of TAG-QA in generating sentences that are both faithful and coherent.
arXiv Detail & Related papers (2023-09-20T03:52:34Z) - QTSumm: Query-Focused Summarization over Tabular Data [58.62152746690958]
People primarily consult tables to conduct data analysis or answer specific questions.
We define a new query-focused table summarization task, where text generation models have to perform human-like reasoning.
We introduce a new benchmark named QTSumm for this task, which contains 7,111 human-annotated query-summary pairs over 2,934 tables.
arXiv Detail & Related papers (2023-05-23T17:43:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.