Embedding generation for text classification of Brazilian Portuguese
user reviews: from bag-of-words to transformers
- URL: http://arxiv.org/abs/2212.00587v1
- Date: Thu, 1 Dec 2022 15:24:19 GMT
- Title: Embedding generation for text classification of Brazilian Portuguese
user reviews: from bag-of-words to transformers
- Authors: Frederico Dias Souza and Jo\~ao Baptista de Oliveira e Souza Filho
- Abstract summary: This study includes from classical (Bag-of-Words) to state-of-the-art (Transformer-based) NLP models.
It aims to provide a comprehensive experimental study of embedding approaches targeting a binary sentiment classification of user reviews in Brazilian Portuguese.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text classification is a natural language processing (NLP) task relevant to
many commercial applications, like e-commerce and customer service. Naturally,
classifying such excerpts accurately often represents a challenge, due to
intrinsic language aspects, like irony and nuance. To accomplish this task, one
must provide a robust numerical representation for documents, a process known
as embedding. Embedding represents a key NLP field nowadays, having faced a
significant advance in the last decade, especially after the introduction of
the word-to-vector concept and the popularization of Deep Learning models for
solving NLP tasks, including Convolutional Neural Networks (CNNs), Recurrent
Neural Networks (RNNs), and Transformer-based Language Models (TLMs). Despite
the impressive achievements in this field, the literature coverage regarding
generating embeddings for Brazilian Portuguese texts is scarce, especially when
considering commercial user reviews. Therefore, this work aims to provide a
comprehensive experimental study of embedding approaches targeting a binary
sentiment classification of user reviews in Brazilian Portuguese. This study
includes from classical (Bag-of-Words) to state-of-the-art (Transformer-based)
NLP models. The methods are evaluated with five open-source databases with
pre-defined data partitions made available in an open digital repository to
encourage reproducibility. The Fine-tuned TLMs achieved the best results for
all cases, being followed by the Feature-based TLM, LSTM, and CNN, with
alternate ranks, depending on the database under analysis.
Related papers
- A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral.
This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - MoSECroT: Model Stitching with Static Word Embeddings for Crosslingual Zero-shot Transfer [50.40191599304911]
We introduce MoSECroT Model Stitching with Static Word Embeddings for Crosslingual Zero-shot Transfer.
In this paper, we present the first framework that leverages relative representations to construct a common space for the embeddings of a source language PLM and the static word embeddings of a target language.
We show that although our proposed framework is competitive with weak baselines when addressing MoSECroT, it fails to achieve competitive results compared with some strong baselines.
arXiv Detail & Related papers (2024-01-09T21:09:07Z) - BERT for Sentiment Analysis: Pre-trained and Fine-Tuned Alternatives [0.0]
BERT has revolutionized the NLP field by enabling transfer learning with large language models.
This article studies how to better cope with the different embeddings provided by the BERT output layer and the usage of language-specific instead of multilingual models.
arXiv Detail & Related papers (2022-01-10T15:05:05Z) - LaoPLM: Pre-trained Language Models for Lao [3.2146309563776416]
Pre-trained language models (PLMs) can capture different levels of concepts in context and hence generate universal language representations.
Although PTMs have been widely used in most NLP applications, it is under-represented in Lao NLP research.
We construct a text classification dataset to alleviate the resource-scare situation of the Lao language.
We present the first transformer-based PTMs for Lao with four versions: BERT-small, BERT-base, ELECTRA-small and ELECTRA-base, and evaluate it over two downstream tasks: part-of-speech tagging and text classification.
arXiv Detail & Related papers (2021-10-12T11:13:07Z) - Reinforced Iterative Knowledge Distillation for Cross-Lingual Named
Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources.
Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages.
We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z) - Deep Learning for Hindi Text Classification: A Comparison [6.8629257716723]
The research in the classification of morphologically rich and low resource Hindi language written in Devanagari script has been limited due to the absence of large labeled corpus.
In this work, we used translated versions of English data-sets to evaluate models based on CNN, LSTM and Attention.
The paper also serves as a tutorial for popular text classification techniques.
arXiv Detail & Related papers (2020-01-19T09:29:12Z) - Exploring the Limits of Transfer Learning with a Unified Text-to-Text
Transformer [64.22926988297685]
Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP)
In this paper, we explore the landscape of introducing transfer learning techniques for NLP by a unified framework that converts all text-based language problems into a text-to-text format.
arXiv Detail & Related papers (2019-10-23T17:37:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.