XSemPLR: Cross-Lingual Semantic Parsing in Multiple Natural Languages
and Meaning Representations
- URL: http://arxiv.org/abs/2306.04085v1
- Date: Wed, 7 Jun 2023 01:09:37 GMT
- Title: XSemPLR: Cross-Lingual Semantic Parsing in Multiple Natural Languages
and Meaning Representations
- Authors: Yusen Zhang, Jun Wang, Zhiguo Wang, Rui Zhang
- Abstract summary: Cross-Lingual Semantic Parsing aims to translate queries in multiple natural languages into meaning representations.
Existing CLSP models are separately proposed and evaluated on datasets of limited tasks and applications.
We present XSemPLR, a unified benchmark for cross-lingual semantic parsing featured with 22 natural languages and 8 meaning representations.
- Score: 25.50509874992198
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Cross-Lingual Semantic Parsing (CLSP) aims to translate queries in multiple
natural languages (NLs) into meaning representations (MRs) such as SQL, lambda
calculus, and logic forms. However, existing CLSP models are separately
proposed and evaluated on datasets of limited tasks and applications, impeding
a comprehensive and unified evaluation of CLSP on a diverse range of NLs and
MRs. To this end, we present XSemPLR, a unified benchmark for cross-lingual
semantic parsing featured with 22 natural languages and 8 meaning
representations by examining and selecting 9 existing datasets to cover 5 tasks
and 164 domains. We use XSemPLR to conduct a comprehensive benchmark study on a
wide range of multilingual language models including encoder-based models
(mBERT, XLM-R), encoder-decoder models (mBART, mT5), and decoder-based models
(Codex, BLOOM). We design 6 experiment settings covering various lingual
combinations (monolingual, multilingual, cross-lingual) and numbers of learning
samples (full dataset, few-shot, and zero-shot). Our experiments show that
encoder-decoder models (mT5) achieve the highest performance compared with
other popular models, and multilingual training can further improve the average
performance. Notably, multilingual large language models (e.g., BLOOM) are
still inadequate to perform CLSP tasks. We also find that the performance gap
between monolingual training and cross-lingual transfer learning is still
significant for multilingual models, though it can be mitigated by
cross-lingual few-shot training. Our dataset and code are available at
https://github.com/psunlpgroup/XSemPLR.
Related papers
- MMTEB: Massive Multilingual Text Embedding Benchmark [85.18187649328792]
We introduce the Massive Multilingual Text Embedding Benchmark (MMTEB)
MMTEB covers over 500 quality-controlled evaluation tasks across 250+ languages.
We develop several highly multilingual benchmarks, which we use to evaluate a representative set of models.
arXiv Detail & Related papers (2025-02-19T10:13:43Z) - LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models [89.13128402847943]
We present LUSIFER, a novel zero-shot approach that adapts LLM-based embedding models for multilingual tasks without requiring multilingual supervision.
LUSIFER's architecture combines a multilingual encoder, serving as a language-universal learner, with an LLM-based embedding model optimized for embedding-specific tasks.
We introduce a new benchmark encompassing 5 primary embedding tasks, 123 diverse datasets, and coverage across 14 languages.
arXiv Detail & Related papers (2025-01-01T15:43:07Z) - Large Language Models for cross-language code clone detection [3.5202378300682162]
Cross-lingual code clone detection has gained traction within the software engineering community.
Inspired by the significant advances in machine learning, this paper revisits cross-lingual code clone detection.
We evaluate the performance of five (05) Large Language Models (LLMs) and eight prompts (08) for the identification of cross-lingual code clones.
arXiv Detail & Related papers (2024-08-08T12:57:14Z) - Efficiently Aligned Cross-Lingual Transfer Learning for Conversational
Tasks using Prompt-Tuning [98.60739735409243]
Cross-lingual transfer of language models trained on high-resource languages like English has been widely studied for many NLP tasks.
We introduce XSGD for cross-lingual alignment pretraining, a parallel and large-scale multilingual conversation dataset.
To facilitate aligned cross-lingual representations, we develop an efficient prompt-tuning-based method for learning alignment prompts.
arXiv Detail & Related papers (2023-04-03T18:46:01Z) - Multi-lingual Evaluation of Code Generation Models [82.7357812992118]
We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X.
These datasets cover over 10 programming languages.
We are able to assess the performance of code generation models in a multi-lingual fashion.
arXiv Detail & Related papers (2022-10-26T17:17:06Z) - A Multilingual Bag-of-Entities Model for Zero-Shot Cross-Lingual Text
Classification [16.684856745734944]
We present a multilingual bag-of-entities model that boosts the performance of zero-shot cross-lingual text classification.
It leverages the multilingual nature of Wikidata: entities in multiple languages representing the same concept are defined with a unique identifier.
A model trained on entity features in a resource-rich language can thus be directly applied to other languages.
arXiv Detail & Related papers (2021-10-15T01:10:50Z) - CoSDA-ML: Multi-Lingual Code-Switching Data Augmentation for Zero-Shot
Cross-Lingual NLP [68.2650714613869]
We propose a data augmentation framework to generate multi-lingual code-switching data to fine-tune mBERT.
Compared with the existing work, our method does not rely on bilingual sentences for training, and requires only one training process for multiple target languages.
arXiv Detail & Related papers (2020-06-11T13:15:59Z) - GLUECoS : An Evaluation Benchmark for Code-Switched NLP [17.066725832825423]
We present an evaluation benchmark, GLUECoS, for code-switched languages.
We present results on several NLP tasks in English-Hindi and English-Spanish.
We fine-tune multilingual models on artificially generated code-switched data.
arXiv Detail & Related papers (2020-04-26T13:28:34Z) - Learning to Scale Multilingual Representations for Vision-Language Tasks [51.27839182889422]
The effectiveness of SMALR is demonstrated with ten diverse languages, over twice the number supported in vision-language tasks to date.
We evaluate on multilingual image-sentence retrieval and outperform prior work by 3-4% with less than 1/5th the training parameters compared to other word embedding methods.
arXiv Detail & Related papers (2020-04-09T01:03:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.