Exploiting News Article Structure for Automatic Corpus Generation of
Entailment Datasets
- URL: http://arxiv.org/abs/2010.11574v3
- Date: Fri, 13 Aug 2021 09:26:29 GMT
- Title: Exploiting News Article Structure for Automatic Corpus Generation of
Entailment Datasets
- Authors: Jan Christian Blaise Cruz, Jose Kristian Resabal, James Lin, Dan John
Velasco and Charibeth Cheng
- Abstract summary: We propose a methodology for automatically producing benchmark datasets for low-resource languages using published news articles.
Second, we produce new pretrained transformers based on the ELECTRA technique to further alleviate the resource scarcity in Filipino.
Third, we perform analyses on transfer learning techniques to shed light on their true performance when operating in low-data domains.
- Score: 1.859931123372708
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Transformers represent the state-of-the-art in Natural Language Processing
(NLP) in recent years, proving effective even in tasks done in low-resource
languages. While pretrained transformers for these languages can be made, it is
challenging to measure their true performance and capacity due to the lack of
hard benchmark datasets, as well as the difficulty and cost of producing them.
In this paper, we present three contributions: First, we propose a methodology
for automatically producing Natural Language Inference (NLI) benchmark datasets
for low-resource languages using published news articles. Through this, we
create and release NewsPH-NLI, the first sentence entailment benchmark dataset
in the low-resource Filipino language. Second, we produce new pretrained
transformers based on the ELECTRA technique to further alleviate the resource
scarcity in Filipino, benchmarking them on our dataset against other
commonly-used transfer learning techniques. Lastly, we perform analyses on
transfer learning techniques to shed light on their true performance when
operating in low-data domains through the use of degradation tests.
Related papers
- Cross-lingual Transfer or Machine Translation? On Data Augmentation for
Monolingual Semantic Textual Similarity [2.422759879602353]
Cross-lingual transfer of Wikipedia data exhibits improved performance for monolingual STS.
We find a superiority of the Wikipedia domain over the NLI domain for these languages, in contrast to prior studies that focused on NLI as training data.
arXiv Detail & Related papers (2024-03-08T12:28:15Z) - MoSECroT: Model Stitching with Static Word Embeddings for Crosslingual Zero-shot Transfer [50.40191599304911]
We introduce MoSECroT Model Stitching with Static Word Embeddings for Crosslingual Zero-shot Transfer.
In this paper, we present the first framework that leverages relative representations to construct a common space for the embeddings of a source language PLM and the static word embeddings of a target language.
We show that although our proposed framework is competitive with weak baselines when addressing MoSECroT, it fails to achieve competitive results compared with some strong baselines.
arXiv Detail & Related papers (2024-01-09T21:09:07Z) - A deep Natural Language Inference predictor without language-specific
training data [44.26507854087991]
We present a technique of NLP to tackle the problem of inference relation (NLI) between pairs of sentences in a target language of choice without a language-specific training dataset.
We exploit a generic translation dataset, manually translated, along with two instances of the same pre-trained model.
The model has been evaluated over machine translated Stanford NLI test dataset, machine translated Multi-Genre NLI test dataset, and manually translated RTE3-ITA test dataset.
arXiv Detail & Related papers (2023-09-06T10:20:59Z) - Strategies for improving low resource speech to text translation relying
on pre-trained ASR models [59.90106959717875]
This paper presents techniques and findings for improving the performance of low-resource speech to text translation (ST)
We conducted experiments on both simulated and real-low resource setups, on language pairs English - Portuguese, and Tamasheq - French respectively.
arXiv Detail & Related papers (2023-05-31T21:58:07Z) - Reinforced Iterative Knowledge Distillation for Cross-Lingual Named
Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources.
Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages.
We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z) - Fine-tuning BERT for Low-Resource Natural Language Understanding via
Active Learning [30.5853328612593]
In this work, we explore fine-tuning methods of BERT -- a pre-trained Transformer based language model.
Our experimental results show an advantage in model performance by maximizing the approximate knowledge gain of the model.
We analyze the benefits of freezing layers of the language model during fine-tuning to reduce the number of trainable parameters.
arXiv Detail & Related papers (2020-12-04T08:34:39Z) - DAGA: Data Augmentation with a Generation Approach for Low-resource
Tagging Tasks [88.62288327934499]
We propose a novel augmentation method with language models trained on the linearized labeled sentences.
Our method is applicable to both supervised and semi-supervised settings.
arXiv Detail & Related papers (2020-11-03T07:49:15Z) - TransQuest: Translation Quality Estimation with Cross-lingual
Transformers [14.403165053223395]
We propose a simple QE framework based on cross-lingual transformers.
We use it to implement and evaluate two different neural architectures.
Our evaluation shows that the proposed methods achieve state-of-the-art results.
arXiv Detail & Related papers (2020-11-01T16:34:44Z) - Dynamic Data Selection and Weighting for Iterative Back-Translation [116.14378571769045]
We propose a curriculum learning strategy for iterative back-translation models.
We evaluate our models on domain adaptation, low-resource, and high-resource MT settings.
Experimental results demonstrate that our methods achieve improvements of up to 1.8 BLEU points over competitive baselines.
arXiv Detail & Related papers (2020-04-07T19:49:58Z) - Exploring the Limits of Transfer Learning with a Unified Text-to-Text
Transformer [64.22926988297685]
Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP)
In this paper, we explore the landscape of introducing transfer learning techniques for NLP by a unified framework that converts all text-based language problems into a text-to-text format.
arXiv Detail & Related papers (2019-10-23T17:37:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.