Mind Your Inflections! Improving NLP for Non-Standard Englishes with
Base-Inflection Encoding
- URL: http://arxiv.org/abs/2004.14870v4
- Date: Wed, 18 Nov 2020 06:16:31 GMT
- Title: Mind Your Inflections! Improving NLP for Non-Standard Englishes with
Base-Inflection Encoding
- Authors: Samson Tan, Shafiq Joty, Lav R. Varshney, Min-Yen Kan
- Abstract summary: Inflectional variation is a common feature of World Englishes such as Colloquial Singapore English and African American Vernacular English.
We propose Base-Inflection forms (BITE) to tokenize English text by reducing inflected words to their base.
We show that our encoding improves the vocabulary efficiency of popular data-driven subword tokenizers.
- Score: 44.356771106881006
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Inflectional variation is a common feature of World Englishes such as
Colloquial Singapore English and African American Vernacular English. Although
comprehension by human readers is usually unimpaired by non-standard
inflections, current NLP systems are not yet robust. We propose Base-Inflection
Encoding (BITE), a method to tokenize English text by reducing inflected words
to their base forms before reinjecting the grammatical information as special
symbols. Fine-tuning pretrained NLP models for downstream tasks using our
encoding defends against inflectional adversaries while maintaining performance
on clean data. Models using BITE generalize better to dialects with
non-standard inflections without explicit training and translation models
converge faster when trained with BITE. Finally, we show that our encoding
improves the vocabulary efficiency of popular data-driven subword tokenizers.
Since there has been no prior work on quantitatively evaluating vocabulary
efficiency, we propose metrics to do so.
Related papers
- Towards preserving word order importance through Forced Invalidation [80.33036864442182]
We show that pre-trained language models are insensitive to word order.
We propose Forced Invalidation to help preserve the importance of word order.
Our experiments demonstrate that Forced Invalidation significantly improves the sensitivity of the models to word order.
arXiv Detail & Related papers (2023-04-11T13:42:10Z) - Dialect-robust Evaluation of Generated Text [40.85375247260744]
We formalize dialect robustness and dialect awareness as goals for NLG evaluation metrics.
Applying the suite to current state-of-the-art metrics, we demonstrate that they are not dialect-robust.
arXiv Detail & Related papers (2022-11-02T07:12:23Z) - Sentence Representation Learning with Generative Objective rather than
Contrastive Objective [86.01683892956144]
We propose a novel generative self-supervised learning objective based on phrase reconstruction.
Our generative learning achieves powerful enough performance improvement and outperforms the current state-of-the-art contrastive methods.
arXiv Detail & Related papers (2022-10-16T07:47:46Z) - Order-sensitive Shapley Values for Evaluating Conceptual Soundness of
NLP Models [13.787554178089444]
Order-sensitive Shapley Values (OSV) is an explanation method for sequential data.
We show that OSV is more faithful in explaining model behavior than gradient-based methods.
We also show that OSV can be leveraged to generate adversarial examples.
arXiv Detail & Related papers (2022-06-01T02:30:12Z) - Understanding by Understanding Not: Modeling Negation in Language Models [81.21351681735973]
Negation is a core construction in natural language.
We propose to augment the language modeling objective with an unlikelihood objective that is based on negated generic sentences.
We reduce the mean top1 error rate to 4% on the negated LAMA dataset.
arXiv Detail & Related papers (2021-05-07T21:58:35Z) - Building Low-Resource NER Models Using Non-Speaker Annotation [58.78968578460793]
Cross-lingual methods have had notable success in addressing these concerns.
We propose a complementary approach to building low-resource Named Entity Recognition (NER) models using non-speaker'' (NS) annotations.
We show that use of NS annotators produces results that are consistently on par or better than cross-lingual methods built on modern contextual representations.
arXiv Detail & Related papers (2020-06-17T03:24:38Z) - It's Morphin' Time! Combating Linguistic Discrimination with
Inflectional Perturbations [68.16751625956243]
Only perfect Standard English corpora predisposes neural networks to discriminate against minorities from non-standard linguistic backgrounds.
We perturb the inflectional morphology of words to craft plausible and semantically similar adversarial examples.
arXiv Detail & Related papers (2020-05-09T04:01:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.