Related papers: Table Question Answering for Low-resourced Indic Languages

Table Question Answering for Low-resourced Indic Languages

URL: http://arxiv.org/abs/2410.03576v1
Date: Fri, 4 Oct 2024 16:26:12 GMT
Title: Table Question Answering for Low-resourced Indic Languages
Authors: Vaishali Pal, Evangelos Kanoulas, Andrew Yates, Maarten de Rijke,
Abstract summary: TableQA is the task of answering questions over tables of structured information, returning individual cells or tables as output. We introduce a fully automatic large-scale tableQA data generation process for low-resource languages with limited budget. We incorporate our data generation method on two Indic languages, Bengali and Hindi, which have no tableQA datasets or models.
Score: 71.57359949962678
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: TableQA is the task of answering questions over tables of structured information, returning individual cells or tables as output. TableQA research has focused primarily on high-resource languages, leaving medium- and low-resource languages with little progress due to scarcity of annotated data and neural models. We address this gap by introducing a fully automatic large-scale tableQA data generation process for low-resource languages with limited budget. We incorporate our data generation method on two Indic languages, Bengali and Hindi, which have no tableQA datasets or models. TableQA models trained on our large-scale datasets outperform state-of-the-art LLMs. We further study the trained models on different aspects, including mathematical reasoning capabilities and zero-shot cross-lingual transfer. Our work is the first on low-resource tableQA focusing on scalable data generation and evaluation procedures. Our proposed data generation method can be applied to any low-resource language with a web presence. We release datasets, models, and code (https://github.com/kolk/Low-Resource-TableQA-Indic-languages).

Related papers

TableEval: A Real-World Benchmark for Complex, Multilingual, and Multi-Structured Table Question Answering [18.173773939709733]
Existing TableQA benchmarks are often limited by their focus on simple flat tables and suffer from data leakage.<n>We introduce TableEval, a new benchmark designed to evaluate LLMs on realistic TableQA tasks.<n>To minimize the risk of data leakage, we collect all data from recent real-world documents.
arXiv Detail & Related papers (2025-06-04T13:39:01Z)
Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models [52.22235443948351]
High-quality multilingual training data is essential for effectively pretraining large language models (LLMs)<n>Here, we introduce JQL, a systematic approach that efficiently curates diverse and high-quality multilingual data at scale.<n>JQL distills LLMs' annotation capabilities into lightweight annotators based on pretrained multilingual embeddings.
arXiv Detail & Related papers (2025-05-28T11:06:54Z)
IndicSQuAD: A Comprehensive Multilingual Question Answering Dataset for Indic Languages [0.4194295877935868]
We present IndicSQuAD, a comprehensive multi-lingual extractive QA dataset covering nine major Indic languages.<n>IndicSQuAD comprises extensive training, validation, and test sets for each language.<n>We evaluate baseline performances using language-specific monolingual BERT models and the multilingual MuRIL-BERT.
arXiv Detail & Related papers (2025-05-06T16:42:54Z)
The Challenge of Achieving Attributability in Multilingual Table-to-Text Generation with Question-Answer Blueprints [0.0]
Table-to-Text NLG is an excellent measure of models' reasoning abilities but is very challenging in the multilingual setting. Intermediate planning techniques like Question-Answer (QA) blueprints have been shown to improve attributability on summarisation tasks. This work explores whether QA blueprints make multilingual Table-to-Text outputs more attributable to the input tables.
arXiv Detail & Related papers (2025-03-29T20:04:00Z)
Enhancing Code Generation for Low-Resource Languages: No Silver Bullet [55.39571645315926]
Large Language Models (LLMs) rely on large and diverse datasets to learn syntax, semantics, and usage patterns of programming languages. For low-resource languages, the limited availability of such data hampers the models' ability to generalize effectively. We present an empirical study investigating the effectiveness of several approaches for boosting LLMs' performance on low-resource languages.
arXiv Detail & Related papers (2025-01-31T12:23:28Z)
INDIC QA BENCHMARK: A Multilingual Benchmark to Evaluate Question Answering capability of LLMs for Indic Languages [26.13077589552484]
Indic-QA is the largest publicly available context-grounded question-answering dataset for 11 major Indian languages from two language families. We generate a synthetic dataset using the Gemini model to create question-answer pairs given a passage, which is then manually verified for quality assurance. We evaluate various multilingual Large Language Models and their instruction-fine-tuned variants on the benchmark and observe that their performance is subpar, particularly for low-resource languages.
arXiv Detail & Related papers (2024-07-18T13:57:16Z)
MoSECroT: Model Stitching with Static Word Embeddings for Crosslingual Zero-shot Transfer [50.40191599304911]
We introduce MoSECroT Model Stitching with Static Word Embeddings for Crosslingual Zero-shot Transfer. In this paper, we present the first framework that leverages relative representations to construct a common space for the embeddings of a source language PLM and the static word embeddings of a target language. We show that although our proposed framework is competitive with weak baselines when addressing MoSECroT, it fails to achieve competitive results compared with some strong baselines.
arXiv Detail & Related papers (2024-01-09T21:09:07Z)
HeLM: Highlighted Evidence augmented Language Model for Enhanced Table-to-Text Generation [7.69801337810352]
We conduct parameter-efficient fine-tuning on the LLaMA2 model. Our approach involves injecting reasoning information into the input by emphasizing table-specific row data. On both the FetaQA and QTSumm datasets, our approach achieved state-of-the-art results.
arXiv Detail & Related papers (2023-11-15T12:02:52Z)
The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants. This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z)
Improving Cross-lingual Information Retrieval on Low-Resource Languages via Optimal Transport Distillation [21.057178077747754]
In this work, we propose OPTICAL: Optimal Transport distillation for low-resource Cross-lingual information retrieval. By separating the cross-lingual knowledge from knowledge of query document matching, OPTICAL only needs bitext data for distillation training. Experimental results show that, with minimal training data, OPTICAL significantly outperforms strong baselines on low-resource languages.
arXiv Detail & Related papers (2023-01-29T22:30:36Z)
QAmeleon: Multilingual QA with Only 5 Examples [71.80611036543633]
We show how to leverage pre-trained language models under a few-shot learning setting. Our approach, QAmeleon, uses a PLM to automatically generate multilingual data upon which QA models are trained. Prompt tuning the PLM for data synthesis with only five examples per language delivers accuracy superior to translation-based baselines.
arXiv Detail & Related papers (2022-11-15T16:14:39Z)
Cross-lingual alignments of ELMo contextual embeddings [0.0]
Cross-lingual embeddings map word embeddings from a low-resource language to a high-resource language. To produce cross-lingual mappings of recent contextual embeddings, anchor points between the embedding spaces have to be words in the same context. We propose novel cross-lingual mapping methods for ELMo embeddings.
arXiv Detail & Related papers (2021-06-30T11:26:43Z)
Multilingual Answer Sentence Reranking via Automatically Translated Data [97.98885151955467]
We present a study on the design of multilingual Answer Sentence Selection (AS2) models, which are a core component of modern Question Answering (QA) systems. The main idea is to transfer data, created from one resource rich language, e.g., English, to other languages, less rich in terms of resources.
arXiv Detail & Related papers (2021-02-20T03:52:08Z)
UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks. Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages. We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.