To Augment or Not to Augment? A Comparative Study on Text Augmentation
Techniques for Low-Resource NLP
- URL: http://arxiv.org/abs/2111.09618v1
- Date: Thu, 18 Nov 2021 10:52:48 GMT
- Title: To Augment or Not to Augment? A Comparative Study on Text Augmentation
Techniques for Low-Resource NLP
- Authors: G\"ozde G\"ul \c{S}ahin
- Abstract summary: We investigate three categories of text augmentation methodologies which perform changes on the syntax.
We compare them on part-of-speech tagging, dependency parsing and semantic role labeling for a diverse set of language families.
Our results suggest that the augmentation techniques can further improve over strong baselines based on mBERT.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Data-hungry deep neural networks have established themselves as the standard
for many NLP tasks including the traditional sequence tagging ones. Despite
their state-of-the-art performance on high-resource languages, they still fall
behind of their statistical counter-parts in low-resource scenarios. One
methodology to counter attack this problem is text augmentation, i.e.,
generating new synthetic training data points from existing data. Although NLP
has recently witnessed a load of textual augmentation techniques, the field
still lacks a systematic performance analysis on a diverse set of languages and
sequence tagging tasks. To fill this gap, we investigate three categories of
text augmentation methodologies which perform changes on the syntax (e.g.,
cropping sub-sentences), token (e.g., random word insertion) and character
(e.g., character swapping) levels. We systematically compare them on
part-of-speech tagging, dependency parsing and semantic role labeling for a
diverse set of language families using various models including the
architectures that rely on pretrained multilingual contextualized language
models such as mBERT. Augmentation most significantly improves dependency
parsing, followed by part-of-speech tagging and semantic role labeling. We find
the experimented techniques to be effective on morphologically rich languages
in general rather than analytic languages such as Vietnamese. Our results
suggest that the augmentation techniques can further improve over strong
baselines based on mBERT. We identify the character-level methods as the most
consistent performers, while synonym replacement and syntactic augmenters
provide inconsistent improvements. Finally, we discuss that the results most
heavily depend on the task, language pair, and the model type.
Related papers
- Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - Distributional Data Augmentation Methods for Low Resource Language [0.9208007322096533]
Easy data augmentation (EDA) augments the training data by injecting and replacing synonyms and randomly permuting sentences.
One major obstacle with EDA is the need for versatile and complete synonym dictionaries, which cannot be easily found in low-resource languages.
We propose two extensions, easy distributional data augmentation (EDDA) and type specific similar word replacement (TSSR), which uses semantic word context information and part-of-speech tags for word replacement and augmentation.
arXiv Detail & Related papers (2023-09-09T19:01:59Z) - T3L: Translate-and-Test Transfer Learning for Cross-Lingual Text
Classification [50.675552118811]
Cross-lingual text classification is typically built on large-scale, multilingual language models (LMs) pretrained on a variety of languages of interest.
We propose revisiting the classic "translate-and-test" pipeline to neatly separate the translation and classification stages.
arXiv Detail & Related papers (2023-06-08T07:33:22Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - Selective Text Augmentation with Word Roles for Low-Resource Text
Classification [3.4806267677524896]
Different words may play different roles in text classification, which inspires us to strategically select the proper roles for text augmentation.
In this work, we first identify the relationships between the words in a text and the text category from the perspectives of statistical correlation and semantic similarity.
We present a new augmentation technique called STA (Selective Text Augmentation) where different text-editing operations are selectively applied to words with specific roles.
arXiv Detail & Related papers (2022-09-04T08:13:11Z) - Multilingual Extraction and Categorization of Lexical Collocations with
Graph-aware Transformers [86.64972552583941]
We put forward a sequence tagging BERT-based model enhanced with a graph-aware transformer architecture, which we evaluate on the task of collocation recognition in context.
Our results suggest that explicitly encoding syntactic dependencies in the model architecture is helpful, and provide insights on differences in collocation typification in English, Spanish and French.
arXiv Detail & Related papers (2022-05-23T16:47:37Z) - A New Generation of Perspective API: Efficient Multilingual
Character-level Transformers [66.9176610388952]
We present the fundamentals behind the next version of the Perspective API from Google Jigsaw.
At the heart of the approach is a single multilingual token-free Charformer model.
We demonstrate that by forgoing static vocabularies, we gain flexibility across a variety of settings.
arXiv Detail & Related papers (2022-02-22T20:55:31Z) - ALP: Data Augmentation using Lexicalized PCFGs for Few-Shot Text
Classification [11.742065170002162]
We present the data augmentation using Lexicalized Probabilistic context-free grammars (ALP)
Experiments on few-shot text classification tasks demonstrate that ALP enhances many state-of-the-art classification methods.
We argue empirically that the traditional splitting of training and validation sets is sub-optimal compared to our novel augmentation-based splitting strategies.
arXiv Detail & Related papers (2021-12-16T09:56:35Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - Cross-lingual Text Classification with Heterogeneous Graph Neural
Network [2.6936806968297913]
Cross-lingual text classification aims at training a classifier on the source language and transferring the knowledge to target languages.
Recent multilingual pretrained language models (mPLM) achieve impressive results in cross-lingual classification tasks.
We propose a simple yet effective method to incorporate heterogeneous information within and across languages for cross-lingual text classification.
arXiv Detail & Related papers (2021-05-24T12:45:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.