One Question Answering Model for Many Languages with Cross-lingual Dense
Passage Retrieval
- URL: http://arxiv.org/abs/2107.11976v1
- Date: Mon, 26 Jul 2021 06:02:54 GMT
- Title: One Question Answering Model for Many Languages with Cross-lingual Dense
Passage Retrieval
- Authors: Akari Asai, Xinyan Yu, Jungo Kasai, Hannaneh Hajishirzi
- Abstract summary: CORA is a Cross-lingual Open-Retrieval Answer Generation model.
It can answer questions across many languages even when language-specific annotated data or knowledge sources are unavailable.
- Score: 39.061900747689094
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present CORA, a Cross-lingual Open-Retrieval Answer Generation model that
can answer questions across many languages even when language-specific
annotated data or knowledge sources are unavailable. We introduce a new dense
passage retrieval algorithm that is trained to retrieve documents across
languages for a question. Combined with a multilingual autoregressive
generation model, CORA answers directly in the target language without any
translation or in-language retrieval modules as used in prior work. We propose
an iterative training method that automatically extends annotated data
available only in high-resource languages to low-resource ones. Our results
show that CORA substantially outperforms the previous state of the art on
multilingual open question answering benchmarks across 26 languages, 9 of which
are unseen during training. Our analyses show the significance of cross-lingual
retrieval and generation in many languages, particularly under low-resource
settings.
Related papers
- What are the limits of cross-lingual dense passage retrieval for low-resource languages? [23.88853455670863]
We analyze the capabilities of the multi-lingual Passage Retriever (mDPR) for extremely low-resource languages.
mDPR achieves success on multilingual open QA benchmarks across 26 languages, of which 9 were unseen during training.
We focus on two extremely low-resource languages for which mDPR performs poorly: Amharic and Khmer.
arXiv Detail & Related papers (2024-08-21T18:51:46Z) - Multilingual Needle in a Haystack: Investigating Long-Context Behavior of Multilingual Large Language Models [22.859955360764275]
We introduce the MultiLingual Needle-in-a-Haystack (MLNeedle) test to assess a model's ability to retrieve relevant information.
We evaluate four state-of-the-art large language models on MLNeedle.
arXiv Detail & Related papers (2024-08-19T17:02:06Z) - Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models [62.91524967852552]
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora.
But can these models relate corresponding concepts across languages, effectively being crosslingual?
This study evaluates six state-of-the-art LLMs on inherently crosslingual tasks.
arXiv Detail & Related papers (2024-06-23T15:15:17Z) - Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - ZusammenQA: Data Augmentation with Specialized Models for Cross-lingual
Open-retrieval Question Answering System [16.89747171947662]
This paper introduces our proposed system for the MIA Shared Task on Cross-lingual Open-retrieval Question Answering (COQA)
In this challenging scenario, given an input question the system has to gather evidence documents from a multilingual pool and generate an answer in the language of the question.
We devised several approaches combining different model variants for three main components: Data Augmentation, Passage Retrieval, and Answer Generation.
arXiv Detail & Related papers (2022-05-30T10:31:08Z) - From Masked Language Modeling to Translation: Non-English Auxiliary
Tasks Improve Zero-shot Spoken Language Understanding [24.149299722716155]
We introduce xSID, a new benchmark for cross-lingual Slot and Intent Detection in 13 languages from 6 language families, including a very low-resource dialect.
We propose a joint learning approach, with English SLU training data and non-English auxiliary tasks from raw text, syntax and translation for transfer.
Our results show that jointly learning the main tasks with masked language modeling is effective for slots, while machine translation transfer works best for intent classification.
arXiv Detail & Related papers (2021-05-15T23:51:11Z) - Pivot Through English: Reliably Answering Multilingual Questions without
Document Retrieval [4.4973334555746]
Existing methods for open-retrieval question answering in lower resource languages (LRLs) lag significantly behind English.
We formulate a task setup more realistic to available resources, that circumvents document retrieval to reliably transfer knowledge from English to lower resource languages.
Within this task setup we propose Reranked Maximal Inner Product Search (RM-MIPS), akin to semantic similarity retrieval over the English training set with reranking.
arXiv Detail & Related papers (2020-12-28T04:38:45Z) - XOR QA: Cross-lingual Open-Retrieval Question Answering [75.20578121267411]
This work extends open-retrieval question answering to a cross-lingual setting.
We construct a large-scale dataset built on questions lacking same-language answers.
arXiv Detail & Related papers (2020-10-22T16:47:17Z) - X-FACTR: Multilingual Factual Knowledge Retrieval from Pretrained
Language Models [103.75890012041366]
Language models (LMs) have proven surprisingly successful at capturing factual knowledge.
However, studies on LMs' factual representation ability have almost invariably been performed on English.
We create a benchmark of cloze-style probes for 23 typologically diverse languages.
arXiv Detail & Related papers (2020-10-13T05:29:56Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.