Comparative Analysis of Multilingual Text Classification &
Identification through Deep Learning and Embedding Visualization
- URL: http://arxiv.org/abs/2312.03789v1
- Date: Wed, 6 Dec 2023 12:03:27 GMT
- Title: Comparative Analysis of Multilingual Text Classification &
Identification through Deep Learning and Embedding Visualization
- Authors: Arinjay Wyawhare
- Abstract summary: The study employs LangDetect, LangId, FastText, and Sentence Transformer on a dataset encompassing 17 languages.
The FastText multi-layer perceptron model achieved remarkable accuracy, precision, recall, and F1 score, outperforming the Sentence Transformer model.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This research conducts a comparative study on multilingual text
classification methods, utilizing deep learning and embedding visualization.
The study employs LangDetect, LangId, FastText, and Sentence Transformer on a
dataset encompassing 17 languages. It explores dimensionality's impact on
clustering, revealing FastText's clearer clustering in 2D visualization due to
its extensive multilingual corpus training. Notably, the FastText multi-layer
perceptron model achieved remarkable accuracy, precision, recall, and F1 score,
outperforming the Sentence Transformer model. The study underscores the
effectiveness of these techniques in multilingual text classification,
emphasizing the importance of large multilingual corpora for training
embeddings. It lays the groundwork for future research and assists
practitioners in developing language detection and classification systems.
Additionally, it includes the comparison of multi-layer perceptron, LSTM, and
Convolution models for classification.
Related papers
- Comparison between parameter-efficient techniques and full fine-tuning: A case study on multilingual news article classification [4.498100922387482]
Adapters and Low-Rank Adaptation (LoRA) are parameter-efficient fine-tuning techniques designed to make the training of language models more efficient.
Previous results demonstrated that these methods can even improve performance on some classification tasks.
This paper investigates how these techniques influence the classification performance and computation costs compared to full fine-tuning.
arXiv Detail & Related papers (2023-08-14T17:12:43Z) - T3L: Translate-and-Test Transfer Learning for Cross-Lingual Text
Classification [50.675552118811]
Cross-lingual text classification is typically built on large-scale, multilingual language models (LMs) pretrained on a variety of languages of interest.
We propose revisiting the classic "translate-and-test" pipeline to neatly separate the translation and classification stages.
arXiv Detail & Related papers (2023-06-08T07:33:22Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - Detecting Text Formality: A Study of Text Classification Approaches [78.11745751651708]
This work proposes the first to our knowledge systematic study of formality detection methods based on statistical, neural-based, and Transformer-based machine learning methods.
We conducted three types of experiments -- monolingual, multilingual, and cross-lingual.
The study shows the overcome of Char BiLSTM model over Transformer-based ones for the monolingual and multilingual formality classification task.
arXiv Detail & Related papers (2022-04-19T16:23:07Z) - On Cross-Lingual Retrieval with Multilingual Text Encoders [51.60862829942932]
We study the suitability of state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks.
We benchmark their performance in unsupervised ad-hoc sentence- and document-level CLIR experiments.
We evaluate multilingual encoders fine-tuned in a supervised fashion (i.e., we learn to rank) on English relevance data in a series of zero-shot language and domain transfer CLIR experiments.
arXiv Detail & Related papers (2021-12-21T08:10:27Z) - To Augment or Not to Augment? A Comparative Study on Text Augmentation
Techniques for Low-Resource NLP [0.0]
We investigate three categories of text augmentation methodologies which perform changes on the syntax.
We compare them on part-of-speech tagging, dependency parsing and semantic role labeling for a diverse set of language families.
Our results suggest that the augmentation techniques can further improve over strong baselines based on mBERT.
arXiv Detail & Related papers (2021-11-18T10:52:48Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - Cross-lingual Text Classification with Heterogeneous Graph Neural
Network [2.6936806968297913]
Cross-lingual text classification aims at training a classifier on the source language and transferring the knowledge to target languages.
Recent multilingual pretrained language models (mPLM) achieve impressive results in cross-lingual classification tasks.
We propose a simple yet effective method to incorporate heterogeneous information within and across languages for cross-lingual text classification.
arXiv Detail & Related papers (2021-05-24T12:45:42Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z) - From text saliency to linguistic objects: learning linguistic
interpretable markers with a multi-channels convolutional architecture [2.064612766965483]
We propose a novel approach to inspect the hidden layers of a fitted CNN in order to extract interpretable linguistic objects from texts exploiting classification process.
We empirically demonstrate the efficiency of our approach on corpora from two different languages: English and French.
arXiv Detail & Related papers (2020-04-07T10:46:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.