Wiki-TabNER: Integrating Named Entity Recognition into Wikipedia Tables
- URL: http://arxiv.org/abs/2403.04577v2
- Date: Fri, 02 May 2025 17:52:05 GMT
- Title: Wiki-TabNER: Integrating Named Entity Recognition into Wikipedia Tables
- Authors: Aneta Koleva, Martin Ringsquandl, Ahmed Hatem, Thomas Runkler, Volker Tresp,
- Abstract summary: A new dataset, Wiki-TabNER, is proposed to enrich the existing benchmark datasets.<n>This paper describes the distinguishing features of the Wiki-TabNER dataset and the labeling process.<n>In addition, we propose a prompting framework for evaluating the new large language models on the within tables NER task.
- Score: 18.330753799139845
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Interest in solving table interpretation tasks has grown over the years, yet it still relies on existing datasets that may be overly simplified. This is potentially reducing the effectiveness of the dataset for thorough evaluation and failing to accurately represent tables as they appear in the real-world. To enrich the existing benchmark datasets, we extract and annotate a new, more challenging dataset. The proposed Wiki-TabNER dataset features complex tables containing several entities per cell, with named entities labeled using DBpedia classes. This dataset is specifically designed to address named entity recognition (NER) task within tables, but it can also be used as a more challenging dataset for evaluating the entity linking task. In this paper we describe the distinguishing features of the Wiki-TabNER dataset and the labeling process. In addition, we propose a prompting framework for evaluating the new large language models on the within tables NER task. Finally, we perform qualitative analysis to gain insights into the challenges encountered by the models and to understand the limitations of the proposed~dataset.
Related papers
- TReB: A Comprehensive Benchmark for Evaluating Table Reasoning Capabilities of Large Language Models [30.26407735827857]
Reasoning with table-structured data poses significant challenges for large language models (LLMs)<n>We present a comprehensive table reasoning evolution benchmark, TReB, which measures both shallow table understanding abilities and deep table reasoning abilities.<n>We create an evaluation framework to robustly measure table reasoning capabilities with three distinct inference modes, TCoT, PoT and ICoT.
arXiv Detail & Related papers (2025-06-23T09:02:04Z) - Bridging Queries and Tables through Entities in Table Retrieval [70.13748256886288]
Entities are well-studied in the context of text retrieval, but there is a noticeable lack of research on their applications in table retrieval.<n>We propose an entity-enhanced training framework and design an interaction paradigm based on entity representations.<n>Our proposed framework is plug-and-play and flexible, making it easy to integrate into existing table retriever training processes.
arXiv Detail & Related papers (2025-04-09T03:16:33Z) - TableRAG: Million-Token Table Understanding with Language Models [53.039560091592215]
TableRAG is a Retrieval-Augmented Generation (RAG) framework specifically designed for LM-based table understanding.<n>TableRAG leverages query expansion combined with schema and cell retrieval to pinpoint crucial information before providing it to the LMs.<n>Our results demonstrate that TableRAG achieves the highest retrieval quality, leading to the new state-of-the-art performance on large-scale table understanding.
arXiv Detail & Related papers (2024-10-07T04:15:02Z) - Knowledge-Aware Reasoning over Multimodal Semi-structured Tables [85.24395216111462]
This study investigates whether current AI models can perform knowledge-aware reasoning on multimodal structured data.
We introduce MMTabQA, a new dataset designed for this purpose.
Our experiments highlight substantial challenges for current AI models in effectively integrating and interpreting multiple text and image inputs.
arXiv Detail & Related papers (2024-08-25T15:17:43Z) - TACT: Advancing Complex Aggregative Reasoning with Information Extraction Tools [51.576974932743596]
Large Language Models (LLMs) often do not perform well on queries that require the aggregation of information across texts.
TACT contains challenging instructions that demand stitching information scattered across one or more texts.
We construct this dataset by leveraging an existing dataset of texts and their associated tables.
We demonstrate that all contemporary LLMs perform poorly on this dataset, achieving an accuracy below 38%.
arXiv Detail & Related papers (2024-06-05T20:32:56Z) - QFMTS: Generating Query-Focused Summaries over Multi-Table Inputs [63.98556480088152]
Table summarization is a crucial task aimed at condensing information into concise and comprehensible textual summaries.
We propose a novel method to address these limitations by introducing query-focused multi-table summarization.
Our approach, which comprises a table serialization module, a summarization controller, and a large language model, generates query-dependent table summaries tailored to users' information needs.
arXiv Detail & Related papers (2024-05-08T15:05:55Z) - Unleashing the Potential of Large Language Models for Predictive Tabular Tasks in Data Science [17.910306140400046]
This research endeavors to apply Large Language Models (LLMs) towards addressing these predictive tasks.
Our research aims to mitigate this gap by compiling a comprehensive corpus of tables annotated with instructions and executing large-scale training of Llama-2.
arXiv Detail & Related papers (2024-03-29T14:41:21Z) - Text2Analysis: A Benchmark of Table Question Answering with Advanced
Data Analysis and Unclear Queries [67.0083902913112]
We develop the Text2Analysis benchmark, incorporating advanced analysis tasks.
We also develop five innovative and effective annotation methods.
We evaluate five state-of-the-art models using three different metrics.
arXiv Detail & Related papers (2023-12-21T08:50:41Z) - Observatory: Characterizing Embeddings of Relational Tables [15.808819332614712]
Researchers and practitioners are keen to leverage language and table embedding models in many new application contexts.
There is an urgent need to gain a comprehensive understanding of these models to minimize inefficiency and failures in downstream usage.
We propose Observatory, a formal framework to systematically analyze embedding representations of relational tables.
arXiv Detail & Related papers (2023-10-05T00:58:45Z) - Schema-Driven Information Extraction from Heterogeneous Tables [37.50854811537401]
We present a benchmark comprised of tables from four diverse domains: machine learning papers, chemistry literature, material science journals, and webpages.
Our experiments demonstrate that surprisingly competitive performance can be achieved without requiring task-specific pipelines or labels.
arXiv Detail & Related papers (2023-05-23T17:58:10Z) - QTSumm: Query-Focused Summarization over Tabular Data [58.62152746690958]
People primarily consult tables to conduct data analysis or answer specific questions.
We define a new query-focused table summarization task, where text generation models have to perform human-like reasoning.
We introduce a new benchmark named QTSumm for this task, which contains 7,111 human-annotated query-summary pairs over 2,934 tables.
arXiv Detail & Related papers (2023-05-23T17:43:51Z) - CTE: A Dataset for Contextualized Table Extraction [1.1859913430860336]
The dataset comprises 75k fully annotated pages of scientific papers, including more than 35k tables.
Data are gathered from PubMed Central, merging the information provided by annotations in the PubTables-1M and PubLayNet datasets.
The generated annotations can be used to develop end-to-end pipelines for various tasks, including document layout analysis, table detection, structure recognition, and functional analysis.
arXiv Detail & Related papers (2023-02-02T22:38:23Z) - HiTab: A Hierarchical Table Dataset for Question Answering and Natural
Language Generation [35.73434495391091]
Hierarchical tables challenge existing methods by hierarchical indexing, as well as implicit relationships of calculation and semantics.
This work presents HiTab, a free and open dataset for the research community to study question answering (QA) and natural language generation (NLG) over hierarchical tables.
arXiv Detail & Related papers (2021-08-15T10:14:21Z) - A Graph Representation of Semi-structured Data for Web Question
Answering [96.46484690047491]
We propose a novel graph representation of Web tables and lists based on a systematic categorization of the components in semi-structured data as well as their relations.
Our method improves F1 score by 3.90 points over the state-of-the-art baselines.
arXiv Detail & Related papers (2020-10-14T04:01:54Z) - GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing [117.98107557103877]
We present GraPPa, an effective pre-training approach for table semantic parsing.
We construct synthetic question-pairs over high-free tables via a synchronous context-free grammar.
To maintain the model's ability to represent real-world data, we also include masked language modeling.
arXiv Detail & Related papers (2020-09-29T08:17:58Z) - Eliminating Catastrophic Interference with Biased Competition [0.0]
We present a model to take advantage of the multi-task nature of complex datasets by learning to separate tasks and subtasks in and end to end manner by biasing competitive interactions in the network.
We demonstrate that this model eliminates catastrophic interference between tasks on a newly created dataset and provides competitive results in the Visual Question Answering space.
arXiv Detail & Related papers (2020-07-03T16:15:15Z) - A Revised Generative Evaluation of Visual Dialogue [80.17353102854405]
We propose a revised evaluation scheme for the VisDial dataset.
We measure consensus between answers generated by the model and a set of relevant answers.
We release these sets and code for the revised evaluation scheme as DenseVisDial.
arXiv Detail & Related papers (2020-04-20T13:26:45Z) - Leveraging Schema Labels to Enhance Dataset Search [20.63182827636973]
We propose a novel schema label generation model which generates possible schema labels based on dataset table content.
We incorporate the generated schema labels into a mixed ranking model which considers the relevance between the query and dataset metadata.
Experiments show that our approach can effectively improve the precision and NDCG scores of the dataset retrieval task.
arXiv Detail & Related papers (2020-01-27T22:41:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.