Triples-to-isiXhosa (T2X): Addressing the Challenges of Low-Resource
Agglutinative Data-to-Text Generation
- URL: http://arxiv.org/abs/2403.07567v1
- Date: Tue, 12 Mar 2024 11:53:27 GMT
- Title: Triples-to-isiXhosa (T2X): Addressing the Challenges of Low-Resource
Agglutinative Data-to-Text Generation
- Authors: Francois Meyer and Jan Buys
- Abstract summary: We tackle data-to-text for isiXhosa, which is low-resource and agglutinative.
We introduce Triples-to-isiXhosa (T2X), a new dataset based on a subset of WebNLG.
We develop an evaluation framework for T2X that measures how accurately generated text describes the data.
- Score: 9.80836683456026
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most data-to-text datasets are for English, so the difficulties of modelling
data-to-text for low-resource languages are largely unexplored. In this paper
we tackle data-to-text for isiXhosa, which is low-resource and agglutinative.
We introduce Triples-to-isiXhosa (T2X), a new dataset based on a subset of
WebNLG, which presents a new linguistic context that shifts modelling demands
to subword-driven techniques. We also develop an evaluation framework for T2X
that measures how accurately generated text describes the data. This enables
future users of T2X to go beyond surface-level metrics in evaluation. On the
modelling side we explore two classes of methods - dedicated data-to-text
models trained from scratch and pretrained language models (PLMs). We propose a
new dedicated architecture aimed at agglutinative data-to-text, the Subword
Segmental Pointer Generator (SSPG). It jointly learns to segment words and copy
entities, and outperforms existing dedicated models for 2 agglutinative
languages (isiXhosa and Finnish). We investigate pretrained solutions for T2X,
which reveals that standard PLMs come up short. Fine-tuning machine translation
models emerges as the best method overall. These findings underscore the
distinct challenge presented by T2X: neither well-established data-to-text
architectures nor customary pretrained methodologies prove optimal. We conclude
with a qualitative analysis of generation errors and an ablation study.
Related papers
- PAXQA: Generating Cross-lingual Question Answering Examples at Training
Scale [53.92008514395125]
PAXQA (Projecting annotations for cross-lingual (x) QA) decomposes cross-lingual QA into two stages.
We propose a novel use of lexically-constrained machine translation, in which constrained entities are extracted from the parallel bitexts.
We show that models fine-tuned on these datasets outperform prior synthetic data generation models over several extractive QA datasets.
arXiv Detail & Related papers (2023-04-24T15:46:26Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - XF2T: Cross-lingual Fact-to-Text Generation for Low-Resource Languages [11.581072296148031]
We conduct an extensive study using popular Transformer-based text generation models on our extended multi-lingual dataset.
Our experiments show that a multi-lingual mT5 model which uses fact-aware embeddings with structure-aware input encoding leads to best results on average across the twelve languages.
arXiv Detail & Related papers (2022-09-22T18:01:27Z) - What Makes Data-to-Text Generation Hard for Pretrained Language Models? [17.07349898176898]
Expressing natural language descriptions of structured facts or relations -- data-to-text generation (D2T) -- increases the accessibility of structured knowledge repositories.
Previous work shows that pre-trained language models(PLMs) perform remarkably well on this task after fine-tuning on a significant amount of task-specific training data.
We conduct an empirical study of both fine-tuned and auto-regressive PLMs on the DART multi-domain D2T dataset.
arXiv Detail & Related papers (2022-05-23T17:58:39Z) - Self-augmented Data Selection for Few-shot Dialogue Generation [18.794770678708637]
We adopt the self-training framework to deal with the few-shot MR-to-Text generation problem.
We propose a novel data selection strategy to select the data that our generation model is most uncertain about.
arXiv Detail & Related papers (2022-05-19T16:25:50Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - KGPT: Knowledge-Grounded Pre-Training for Data-to-Text Generation [100.79870384880333]
We propose a knowledge-grounded pre-training (KGPT) to generate knowledge-enriched text.
We adopt three settings, namely fully-supervised, zero-shot, few-shot to evaluate its effectiveness.
Under zero-shot setting, our model achieves over 30 ROUGE-L on WebNLG while all other baselines fail.
arXiv Detail & Related papers (2020-10-05T19:59:05Z) - POINTER: Constrained Progressive Text Generation via Insertion-based
Generative Pre-training [93.79766670391618]
We present POINTER, a novel insertion-based approach for hard-constrained text generation.
The proposed method operates by progressively inserting new tokens between existing tokens in a parallel manner.
The resulting coarse-to-fine hierarchy makes the generation process intuitive and interpretable.
arXiv Detail & Related papers (2020-05-01T18:11:54Z) - Have Your Text and Use It Too! End-to-End Neural Data-to-Text Generation
with Semantic Fidelity [3.8673630752805432]
We present DataTuner, a neural, end-to-end data-to-text generation system.
We take a two-stage generation-reranking approach, combining a fine-tuned language model with a semantic fidelity.
We show that DataTuner achieves state of the art results on the automated metrics across four major D2T datasets.
arXiv Detail & Related papers (2020-04-08T11:16:53Z) - Abstractive Text Summarization based on Language Model Conditioning and
Locality Modeling [4.525267347429154]
We train a Transformer-based neural model on the BERT language model.
In addition, we propose a new method of BERT-windowing, which allows chunk-wise processing of texts longer than the BERT window size.
The results of our models are compared to a baseline and the state-of-the-art models on the CNN/Daily Mail dataset.
arXiv Detail & Related papers (2020-03-29T14:00:17Z) - Towards Making the Most of Context in Neural Machine Translation [112.9845226123306]
We argue that previous research did not make a clear use of the global context.
We propose a new document-level NMT framework that deliberately models the local context of each sentence.
arXiv Detail & Related papers (2020-02-19T03:30:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.