Addressing Issues of Cross-Linguality in Open-Retrieval Question
Answering Systems For Emergent Domains
- URL: http://arxiv.org/abs/2201.11153v1
- Date: Wed, 26 Jan 2022 19:27:32 GMT
- Title: Addressing Issues of Cross-Linguality in Open-Retrieval Question
Answering Systems For Emergent Domains
- Authors: Alon Albalak, Sharon Levy, and William Yang Wang
- Abstract summary: We demonstrate a cross-lingual open-retrieval question answering system for the emergent domain of COVID-19.
Our system adopts a corpus of scientific articles to ensure that retrieved documents are reliable.
We show that a deep semantic retriever greatly benefits from training on our English-to-all data and significantly outperforms a BM25 baseline in the cross-lingual setting.
- Score: 67.99403521976058
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Open-retrieval question answering systems are generally trained and tested on
large datasets in well-established domains. However, low-resource settings such
as new and emerging domains would especially benefit from reliable question
answering systems. Furthermore, multilingual and cross-lingual resources in
emergent domains are scarce, leading to few or no such systems. In this paper,
we demonstrate a cross-lingual open-retrieval question answering system for the
emergent domain of COVID-19. Our system adopts a corpus of scientific articles
to ensure that retrieved documents are reliable. To address the scarcity of
cross-lingual training data in emergent domains, we present a method utilizing
automatic translation, alignment, and filtering to produce English-to-all
datasets. We show that a deep semantic retriever greatly benefits from training
on our English-to-all data and significantly outperforms a BM25 baseline in the
cross-lingual setting. We illustrate the capabilities of our system with
examples and release all code necessary to train and deploy such a system.
Related papers
- Evaluating and Modeling Attribution for Cross-Lingual Question Answering [80.4807682093432]
This work is the first to study attribution for cross-lingual question answering.
We collect data in 5 languages to assess the attribution level of a state-of-the-art cross-lingual QA system.
We find that a substantial portion of the answers is not attributable to any retrieved passages.
arXiv Detail & Related papers (2023-05-23T17:57:46Z) - DAMO-NLP at SemEval-2023 Task 2: A Unified Retrieval-augmented System
for Multilingual Named Entity Recognition [94.90258603217008]
The MultiCoNER RNum2 shared task aims to tackle multilingual named entity recognition (NER) in fine-grained and noisy scenarios.
Previous top systems in the MultiCoNER RNum1 either incorporate the knowledge bases or gazetteers.
We propose a unified retrieval-augmented system (U-RaNER) for fine-grained multilingual NER.
arXiv Detail & Related papers (2023-05-05T16:59:26Z) - ZusammenQA: Data Augmentation with Specialized Models for Cross-lingual
Open-retrieval Question Answering System [16.89747171947662]
This paper introduces our proposed system for the MIA Shared Task on Cross-lingual Open-retrieval Question Answering (COQA)
In this challenging scenario, given an input question the system has to gather evidence documents from a multilingual pool and generate an answer in the language of the question.
We devised several approaches combining different model variants for three main components: Data Augmentation, Passage Retrieval, and Answer Generation.
arXiv Detail & Related papers (2022-05-30T10:31:08Z) - Design and Development of Rule-based open-domain Question-Answering
System on SQuAD v2.0 Dataset [0.0]
We have proposed a rule-based open-domain question-answering system which is capable of answering questions of any domain from a corresponding context passage.
We have used 1000 questions from SQuAD 2.0 dataset for testing the developed system and it gives satisfactory results.
arXiv Detail & Related papers (2022-03-27T07:51:18Z) - Multi-CPR: A Multi Domain Chinese Dataset for Passage Retrieval [19.000263567641817]
We present a novel multi-domain Chinese dataset for passage retrieval (Multi-CPR)
The dataset is collected from three different domains, including E-commerce, Entertainment video and Medical.
We find that the performance of retrieval models trained on dataset from general domain will inevitably decrease on specific domain.
arXiv Detail & Related papers (2022-03-07T13:20:46Z) - Learning Domain-Specialised Representations for Cross-Lingual Biomedical
Entity Linking [66.76141128555099]
We propose a novel cross-lingual biomedical entity linking task (XL-BEL)
We first investigate the ability of standard knowledge-agnostic as well as knowledge-enhanced monolingual and multilingual LMs beyond the standard monolingual English BEL task.
We then address the challenge of transferring domain-specific knowledge in resource-rich languages to resource-poor ones.
arXiv Detail & Related papers (2021-05-30T00:50:00Z) - Towards More Equitable Question Answering Systems: How Much More Data Do
You Need? [15.401330338654203]
We take a step back and study which approaches allow us to take the most advantage of existing resources in order to produce QA systems in many languages.
Specifically, we perform extensive analysis to measure the efficacy of few-shot approaches augmented with automatic translations and permutations of context-question-answer pairs.
We make suggestions for future dataset development efforts that make better use of a fixed annotation budget, with a goal of increasing the language coverage of QA datasets and systems.
arXiv Detail & Related papers (2021-05-28T21:32:04Z) - FDMT: A Benchmark Dataset for Fine-grained Domain Adaptation in Machine
Translation [53.87731008029645]
We present a real-world fine-grained domain adaptation task in machine translation (FDMT)
The FDMT dataset consists of four sub-domains of information technology: autonomous vehicles, AI education, real-time networks and smart phone.
We make quantitative experiments and deep analyses in this new setting, which benchmarks the fine-grained domain adaptation task.
arXiv Detail & Related papers (2020-12-31T17:15:09Z) - Unsupervised Domain Clusters in Pretrained Language Models [61.832234606157286]
We show that massive pre-trained language models implicitly learn sentence representations that cluster by domains without supervision.
We propose domain data selection methods based on such models.
We evaluate our data selection methods for neural machine translation across five diverse domains.
arXiv Detail & Related papers (2020-04-05T06:22:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.