Multilingual Compositional Wikidata Questions
- URL: http://arxiv.org/abs/2108.03509v1
- Date: Sat, 7 Aug 2021 19:40:38 GMT
- Title: Multilingual Compositional Wikidata Questions
- Authors: Ruixiang Cui, Rahul Aralikatte, Heather Lent, Daniel Hershcovich
- Abstract summary: We propose a method for creating a multilingual, parallel dataset of question-Query pairs grounded in Wikidata.
We use this data to train semantics for Hebrew, Kannada, Chinese and English to better understand the current strengths and weaknesses of multilingual semantic parsing.
- Score: 9.602430657819564
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Semantic parsing allows humans to leverage vast knowledge resources through
natural interaction. However, parsers are mostly designed for and evaluated on
English resources, such as CFQ (Keysers et al., 2020), the current standard
benchmark based on English data generated from grammar rules and oriented
towards Freebase, an outdated knowledge base. We propose a method for creating
a multilingual, parallel dataset of question-query pairs, grounded in Wikidata,
and introduce such a dataset called Compositional Wikidata Questions (CWQ). We
utilize this data to train and evaluate semantic parsers for Hebrew, Kannada,
Chinese and English, to better understand the current strengths and weaknesses
of multilingual semantic parsing. Experiments on zero-shot cross-lingual
transfer demonstrate that models fail to generate valid queries even with
pretrained multilingual encoders. Our methodology, dataset and results will
facilitate future research on semantic parsing in more realistic and diverse
settings than has been possible with existing resources.
Related papers
- PAXQA: Generating Cross-lingual Question Answering Examples at Training
Scale [53.92008514395125]
PAXQA (Projecting annotations for cross-lingual (x) QA) decomposes cross-lingual QA into two stages.
We propose a novel use of lexically-constrained machine translation, in which constrained entities are extracted from the parallel bitexts.
We show that models fine-tuned on these datasets outperform prior synthetic data generation models over several extractive QA datasets.
arXiv Detail & Related papers (2023-04-24T15:46:26Z) - Semantic Parsing for Conversational Question Answering over Knowledge
Graphs [63.939700311269156]
We develop a dataset where user questions are annotated with Sparql parses and system answers correspond to execution results thereof.
We present two different semantic parsing approaches and highlight the challenges of the task.
Our dataset and models are released at https://github.com/Edinburgh/SPICE.
arXiv Detail & Related papers (2023-01-28T14:45:11Z) - XRICL: Cross-lingual Retrieval-Augmented In-Context Learning for
Cross-lingual Text-to-SQL Semantic Parsing [70.40401197026925]
In-context learning using large language models has recently shown surprising results for semantic parsing tasks.
This work introduces the XRICL framework, which learns to retrieve relevant English exemplars for a given query.
We also include global translation exemplars for a target language to facilitate the translation process for large language models.
arXiv Detail & Related papers (2022-10-25T01:33:49Z) - Meta-Learning a Cross-lingual Manifold for Semantic Parsing [75.26271012018861]
Localizing a semantic to support new languages requires effective cross-lingual generalization.
We introduce a first-order meta-learning algorithm to train a semantic annotated with maximal sample efficiency during cross-lingual transfer.
Results across six languages on ATIS demonstrate that our combination of steps yields accurate semantics sampling $le$10% of source training data in each new language.
arXiv Detail & Related papers (2022-09-26T10:42:17Z) - A Chinese Multi-type Complex Questions Answering Dataset over Wikidata [45.31495982252219]
Complex Knowledge Base Question Answering is a popular area of research in the past decade.
Recent public datasets have led to encouraging results in this field, but are mostly limited to English.
Few state-of-the-art KBQA models are trained on Wikidata, one of the most popular real-world knowledge bases.
We propose CLC-QuAD, the first large scale complex Chinese semantic parsing dataset over Wikidata to address these challenges.
arXiv Detail & Related papers (2021-11-11T07:39:16Z) - Assessing the quality of sources in Wikidata across languages: a hybrid
approach [64.05097584373979]
We run a series of microtasks experiments to evaluate a large corpus of references, sampled from Wikidata triples with labels in several languages.
We use a consolidated, curated version of the crowdsourced assessments to train several machine learning models to scale up the analysis to the whole of Wikidata.
The findings help us ascertain the quality of references in Wikidata, and identify common challenges in defining and capturing the quality of user-generated multilingual structured data on the web.
arXiv Detail & Related papers (2021-09-20T10:06:46Z) - Multilingual Neural Semantic Parsing for Low-Resourced Languages [1.6244541005112747]
We introduce a new multilingual semantic parsing dataset in English, Italian and Japanese.
We show that joint multilingual training with pretrained encoders substantially outperforms our baselines on the TOP dataset.
We find that a semantic trained only on English data achieves a zero-shot performance of 44.9% exact-match accuracy on Italian sentences.
arXiv Detail & Related papers (2021-06-07T09:53:02Z) - The RELX Dataset and Matching the Multilingual Blanks for Cross-Lingual
Relation Classification [0.0]
Current approaches for relation classification are mainly focused on the English language.
We propose two cross-lingual relation classification models: a baseline model based on Multilingual BERT and a new multilingual pretraining setup.
For evaluation, we introduce a new public benchmark dataset for cross-lingual relation classification in English, French, German, Spanish, and Turkish.
arXiv Detail & Related papers (2020-10-19T11:08:16Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.