Multi-Task Text Classification using Graph Convolutional Networks for
Large-Scale Low Resource Language
- URL: http://arxiv.org/abs/2205.01204v1
- Date: Mon, 2 May 2022 20:44:12 GMT
- Title: Multi-Task Text Classification using Graph Convolutional Networks for
Large-Scale Low Resource Language
- Authors: Mounika Marreddy, Subba Reddy Oota, Lakshmi Sireesha Vakada, Venkata
Charan Chinni, Radhika Mamidi
- Abstract summary: Graph Convolutional Networks (GCN) have achieved state-of-art results on single text classification tasks.
Applying GCN for multi-task text classification is an unexplored area.
We study the use of GCN for the Telugu language in single and multi-task settings for four natural language processing (NLP) tasks.
- Score: 5.197307534263253
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Graph Convolutional Networks (GCN) have achieved state-of-art results on
single text classification tasks like sentiment analysis, emotion detection,
etc. However, the performance is achieved by testing and reporting on
resource-rich languages like English. Applying GCN for multi-task text
classification is an unexplored area. Moreover, training a GCN or adopting an
English GCN for Indian languages is often limited by data availability, rich
morphological variation, syntax, and semantic differences. In this paper, we
study the use of GCN for the Telugu language in single and multi-task settings
for four natural language processing (NLP) tasks, viz. sentiment analysis (SA),
emotion identification (EI), hate-speech (HS), and sarcasm detection (SAR). In
order to evaluate the performance of GCN with one of the Indian languages,
Telugu, we analyze the GCN based models with extensive experiments on four
downstream tasks. In addition, we created an annotated Telugu dataset, TEL-NLP,
for the four NLP tasks. Further, we propose a supervised graph reconstruction
method, Multi-Task Text GCN (MT-Text GCN) on the Telugu that leverages to
simultaneously (i) learn the low-dimensional word and sentence graph embeddings
from word-sentence graph reconstruction using graph autoencoder (GAE) and (ii)
perform multi-task text classification using these latent sentence graph
embeddings. We argue that our proposed MT-Text GCN achieves significant
improvements on TEL-NLP over existing Telugu pretrained word embeddings, and
multilingual pretrained Transformer models: mBERT, and XLM-R. On TEL-NLP, we
achieve a high F1-score for four NLP tasks: SA (0.84), EI (0.55), HS (0.83) and
SAR (0.66). Finally, we show our model's quantitative and qualitative analysis
on the four NLP tasks in Telugu.
Related papers
- Chain-of-Translation Prompting (CoTR): A Novel Prompting Technique for Low Resource Languages [0.4499833362998489]
Chain of Translation Prompting (CoTR) is a novel strategy designed to enhance the performance of language models in low-resource languages.
CoTR restructures prompts to first translate the input context from a low-resource language into a higher-resource language, such as English.
We demonstrate the effectiveness of this method through a case study on the low-resource Indic language Marathi.
arXiv Detail & Related papers (2024-09-06T17:15:17Z) - A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Dim Wihl Gat Tun: The Case for Linguistic Expertise in NLP for
Underdocumented Languages [6.8708103492634836]
Hundreds of underserved languages have available data sources in the form of interlinear glossed text (IGT) from language documentation efforts.
We make the case that IGT data can be leveraged successfully provided that target language expertise is available.
We illustrate each step through a case study on developing a morphological reinflection system for the Tsimchianic language Gitksan.
arXiv Detail & Related papers (2022-03-17T22:02:25Z) - To Augment or Not to Augment? A Comparative Study on Text Augmentation
Techniques for Low-Resource NLP [0.0]
We investigate three categories of text augmentation methodologies which perform changes on the syntax.
We compare them on part-of-speech tagging, dependency parsing and semantic role labeling for a diverse set of language families.
Our results suggest that the augmentation techniques can further improve over strong baselines based on mBERT.
arXiv Detail & Related papers (2021-11-18T10:52:48Z) - Graph Neural Networks for Natural Language Processing: A Survey [64.36633422999905]
We present a comprehensive overview onGraph Neural Networks (GNNs) for Natural Language Processing.
We propose a new taxonomy of GNNs for NLP, which organizes existing research of GNNs for NLP along three axes: graph construction,graph representation learning, and graph based encoder-decoder models.
arXiv Detail & Related papers (2021-06-10T23:59:26Z) - Graph Convolutional Network for Swahili News Classification [78.6363825307044]
This work empirically demonstrates the ability of Text Graph Convolutional Network (Text GCN) to outperform traditional natural language processing benchmarks for the task of semi-supervised Swahili news classification.
arXiv Detail & Related papers (2021-03-16T21:03:47Z) - N-LTP: An Open-source Neural Language Technology Platform for Chinese [68.58732970171747]
textttN- is an open-source neural language technology platform supporting six fundamental Chinese NLP tasks.
textttN- adopts the multi-task framework by using a shared pre-trained model, which has the advantage of capturing the shared knowledge across relevant Chinese tasks.
arXiv Detail & Related papers (2020-09-24T11:45:39Z) - IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural
Language Understanding [41.691861010118394]
We introduce the first-ever vast resource for the training, evaluating, and benchmarking on Indonesian natural language understanding tasks.
IndoNLU includes twelve tasks, ranging from single sentence classification to pair-sentences sequence labeling with different levels of complexity.
The datasets for the tasks lie in different domains and styles to ensure task diversity.
We also provide a set of Indonesian pre-trained models (IndoBERT) trained from a large and clean Indonesian dataset Indo4B.
arXiv Detail & Related papers (2020-09-11T12:21:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.