XF2T: Cross-lingual Fact-to-Text Generation for Low-Resource Languages
- URL: http://arxiv.org/abs/2209.11252v1
- Date: Thu, 22 Sep 2022 18:01:27 GMT
- Title: XF2T: Cross-lingual Fact-to-Text Generation for Low-Resource Languages
- Authors: Shivprasad Sagare, Tushar Abhishek, Bhavyajeet Singh, Anubhav Sharma,
Manish Gupta, Vasudeva Varma
- Abstract summary: We conduct an extensive study using popular Transformer-based text generation models on our extended multi-lingual dataset.
Our experiments show that a multi-lingual mT5 model which uses fact-aware embeddings with structure-aware input encoding leads to best results on average across the twelve languages.
- Score: 11.581072296148031
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multiple business scenarios require an automated generation of descriptive
human-readable text from structured input data. Hence, fact-to-text generation
systems have been developed for various downstream tasks like generating soccer
reports, weather and financial reports, medical reports, person biographies,
etc. Unfortunately, previous work on fact-to-text (F2T) generation has focused
primarily on English mainly due to the high availability of relevant datasets.
Only recently, the problem of cross-lingual fact-to-text (XF2T) was proposed
for generation across multiple languages alongwith a dataset, XALIGN for eight
languages. However, there has been no rigorous work on the actual XF2T
generation problem. We extend XALIGN dataset with annotated data for four more
languages: Punjabi, Malayalam, Assamese and Oriya. We conduct an extensive
study using popular Transformer-based text generation models on our extended
multi-lingual dataset, which we call XALIGNV2. Further, we investigate the
performance of different text generation strategies: multiple variations of
pretraining, fact-aware embeddings and structure-aware input encoding. Our
extensive experiments show that a multi-lingual mT5 model which uses fact-aware
embeddings with structure-aware input encoding leads to best results on average
across the twelve languages. We make our code, dataset and model publicly
available, and hope that this will help advance further research in this
critical area.
Related papers
- Triples-to-isiXhosa (T2X): Addressing the Challenges of Low-Resource
Agglutinative Data-to-Text Generation [9.80836683456026]
We tackle data-to-text for isiXhosa, which is low-resource and agglutinative.
We introduce Triples-to-isiXhosa (T2X), a new dataset based on a subset of WebNLG.
We develop an evaluation framework for T2X that measures how accurately generated text describes the data.
arXiv Detail & Related papers (2024-03-12T11:53:27Z) - Cross-lingual Editing in Multilingual Language Models [1.3062731746155414]
This paper introduces the cross-lingual model editing (textbfXME) paradigm, wherein a fact is edited in one language, and the subsequent update propagation is observed across other languages.
The results reveal notable performance limitations of state-of-the-art METs under the XME setting, mainly when the languages involved belong to two distinct script families.
arXiv Detail & Related papers (2024-01-19T06:54:39Z) - Romanization-based Large-scale Adaptation of Multilingual Language
Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP.
We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages.
Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z) - XRICL: Cross-lingual Retrieval-Augmented In-Context Learning for
Cross-lingual Text-to-SQL Semantic Parsing [70.40401197026925]
In-context learning using large language models has recently shown surprising results for semantic parsing tasks.
This work introduces the XRICL framework, which learns to retrieve relevant English exemplars for a given query.
We also include global translation exemplars for a target language to facilitate the translation process for large language models.
arXiv Detail & Related papers (2022-10-25T01:33:49Z) - XAlign: Cross-lingual Fact-to-Text Alignment and Generation for
Low-Resource Languages [11.581072296148031]
Multiple critical scenarios (like Wikipedia text generation given English Infoboxes) need automated generation of descriptive text in low resource (LR) languages from English fact triples.
To the best of our knowledge, there has been no previous attempt on cross-lingual alignment or generation for LR languages.
We propose two unsupervised methods for cross-lingual alignment.
arXiv Detail & Related papers (2022-02-01T09:41:59Z) - MFAQ: a Multilingual FAQ Dataset [9.625301186732598]
We present the first multilingual FAQ dataset publicly available.
We collected around 6M FAQ pairs from the web, in 21 different languages.
We adopt a similar setup as Dense Passage Retrieval (DPR) and test various bi-encoders on this dataset.
arXiv Detail & Related papers (2021-09-27T08:43:25Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Multilingual AMR-to-Text Generation [22.842874899794996]
We create multilingual AMR-to-text models that generate in twenty one different languages.
For eighteen languages, based on automatic metrics, our multilingual models surpass baselines that generate into a single language.
We analyse the ability of our multilingual models to accurately capture morphology and word order using human evaluation, and find that native speakers judge our generations to be fluent.
arXiv Detail & Related papers (2020-11-10T22:47:14Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z) - XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training,
Understanding and Generation [100.09099800591822]
XGLUE is a new benchmark dataset that can be used to train large-scale cross-lingual pre-trained models.
XGLUE provides 11 diversified tasks that cover both natural language understanding and generation scenarios.
arXiv Detail & Related papers (2020-04-03T07:03:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.