Related papers: Towards More Equitable Question Answering Systems: How Much More Data Do You Need?

Towards More Equitable Question Answering Systems: How Much More Data Do You Need?

URL: http://arxiv.org/abs/2105.14115v1
Date: Fri, 28 May 2021 21:32:04 GMT
Title: Towards More Equitable Question Answering Systems: How Much More Data Do You Need?
Authors: Arnab Debnath, Navid Rajabi, Fardina Fathmiul Alam, Antonios Anastasopoulos
Abstract summary: We take a step back and study which approaches allow us to take the most advantage of existing resources in order to produce QA systems in many languages. Specifically, we perform extensive analysis to measure the efficacy of few-shot approaches augmented with automatic translations and permutations of context-question-answer pairs. We make suggestions for future dataset development efforts that make better use of a fixed annotation budget, with a goal of increasing the language coverage of QA datasets and systems.
Score: 15.401330338654203
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Question answering (QA) in English has been widely explored, but multilingual datasets are relatively new, with several methods attempting to bridge the gap between high- and low-resourced languages using data augmentation through translation and cross-lingual transfer. In this project, we take a step back and study which approaches allow us to take the most advantage of existing resources in order to produce QA systems in many languages. Specifically, we perform extensive analysis to measure the efficacy of few-shot approaches augmented with automatic translations and permutations of context-question-answer pairs. In addition, we make suggestions for future dataset development efforts that make better use of a fixed annotation budget, with a goal of increasing the language coverage of QA datasets and systems. Code and data for reproducing our experiments are available here: https://github.com/NavidRajabi/EMQA.

Related papers

Bridging Language Gaps: Advances in Cross-Lingual Information Retrieval with Multilingual LLMs [0.19116784879310025]
Cross-lingual information retrieval (CLIR) addresses the challenge of retrieving relevant documents written in languages different from that of the original query.<n>Recent advances have shifted from translation-based methods toward embedding-based approaches.<n>This survey provides a comprehensive overview of developments from early translation-based methods to state-of-the-art embedding-driven and generative techniques.
arXiv Detail & Related papers (2025-10-01T13:50:05Z)
Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models [52.22235443948351]
High-quality multilingual training data is essential for effectively pretraining large language models (LLMs)<n>Here, we introduce JQL, a systematic approach that efficiently curates diverse and high-quality multilingual data at scale.<n>JQL distills LLMs' annotation capabilities into lightweight annotators based on pretrained multilingual embeddings.
arXiv Detail & Related papers (2025-05-28T11:06:54Z)
Multilingual Retrieval-Augmented Generation for Knowledge-Intensive Task [73.35882908048423]
Retrieval-augmented generation (RAG) has become a cornerstone of contemporary NLP. This paper investigates the effectiveness of RAG across multiple languages by proposing novel approaches for multilingual open-domain question-answering.
arXiv Detail & Related papers (2025-04-04T17:35:43Z)
Cross-lingual Transfer for Automatic Question Generation by Learning Interrogative Structures in Target Languages [6.635572580071933]
We propose a simple and efficient XLT-QG method that operates without the need for monolingual, parallel, or labeled data in the target language. Our method achieves performance comparable to GPT-3.5-turbo across different languages.
arXiv Detail & Related papers (2024-10-04T07:29:35Z)
MST5 -- Multilingual Question Answering over Knowledge Graphs [1.6470999044938401]
Knowledge Graph Question Answering (KGQA) simplifies querying vast amounts of knowledge stored in a graph-based model using natural language. Existing multilingual KGQA systems face challenges in achieving performance comparable to English systems. We propose a simplified approach to enhance multilingual KGQA systems by incorporating linguistic context and entity information directly into the processing pipeline of a language model.
arXiv Detail & Related papers (2024-07-08T15:37:51Z)
Datasets for Multilingual Answer Sentence Selection [59.28492975191415]
We introduce new high-quality datasets for AS2 in five European languages (French, German, Italian, Portuguese, and Spanish) Results indicate that our datasets are pivotal in producing robust and powerful multilingual AS2 models.
arXiv Detail & Related papers (2024-06-14T16:50:29Z)
mCSQA: Multilingual Commonsense Reasoning Dataset with Unified Creation Strategy by Language Models and Humans [27.84922167294656]
It is challenging to curate a dataset for language-specific knowledge and common sense. Most current multilingual datasets are created through translation, which cannot evaluate such language-specific aspects. We propose Multilingual CommonsenseQA (mCSQA) based on the construction process of CSQA but leveraging language models for a more efficient construction.
arXiv Detail & Related papers (2024-06-06T16:14:54Z)
Can a Multichoice Dataset be Repurposed for Extractive Question Answering? [52.28197971066953]
We repurposed the Belebele dataset (Bandarkar et al., 2023), which was designed for multiple-choice question answering (MCQA) We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic (MSA). Our aim is to enable others to adapt our approach for the 120+ other language variants in Belebele, many of which are deemed under-resourced.
arXiv Detail & Related papers (2024-04-26T11:46:05Z)
Evaluating and Modeling Attribution for Cross-Lingual Question Answering [80.4807682093432]
This work is the first to study attribution for cross-lingual question answering. We collect data in 5 languages to assess the attribution level of a state-of-the-art cross-lingual QA system. We find that a substantial portion of the answers is not attributable to any retrieved passages.
arXiv Detail & Related papers (2023-05-23T17:57:46Z)
Soft Prompt Decoding for Multilingual Dense Retrieval [30.766917713997355]
We show that applying state-of-the-art approaches developed for cross-lingual information retrieval to MLIR tasks leads to sub-optimal performance. This is due to the heterogeneous and imbalanced nature of multilingual collections. We present KD-SPD, a novel soft prompt decoding approach for MLIR that implicitly "translates" the representation of documents in different languages into the same embedding space.
arXiv Detail & Related papers (2023-05-15T21:17:17Z)
PAXQA: Generating Cross-lingual Question Answering Examples at Training Scale [53.92008514395125]
PAXQA (Projecting annotations for cross-lingual (x) QA) decomposes cross-lingual QA into two stages. We propose a novel use of lexically-constrained machine translation, in which constrained entities are extracted from the parallel bitexts. We show that models fine-tuned on these datasets outperform prior synthetic data generation models over several extractive QA datasets.
arXiv Detail & Related papers (2023-04-24T15:46:26Z)
QAmeleon: Multilingual QA with Only 5 Examples [71.80611036543633]
We show how to leverage pre-trained language models under a few-shot learning setting. Our approach, QAmeleon, uses a PLM to automatically generate multilingual data upon which QA models are trained. Prompt tuning the PLM for data synthesis with only five examples per language delivers accuracy superior to translation-based baselines.
arXiv Detail & Related papers (2022-11-15T16:14:39Z)
Multilingual Answer Sentence Reranking via Automatically Translated Data [97.98885151955467]
We present a study on the design of multilingual Answer Sentence Selection (AS2) models, which are a core component of modern Question Answering (QA) systems. The main idea is to transfer data, created from one resource rich language, e.g., English, to other languages, less rich in terms of resources.
arXiv Detail & Related papers (2021-02-20T03:52:08Z)
Multilingual Transfer Learning for QA Using Translation as Data Augmentation [13.434957024596898]
We explore strategies that improve cross-lingual transfer by bringing the multilingual embeddings closer in the semantic space. We propose two novel strategies, language adversarial training and language arbitration framework, which significantly improve the (zero-resource) cross-lingual transfer performance. Empirically, we show that the proposed models outperform the previous zero-shot baseline on the recently introduced multilingual MLQA and TyDiQA datasets.
arXiv Detail & Related papers (2020-12-10T20:29:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.