Feature Selection on Noisy Twitter Short Text Messages for Language
Identification
- URL: http://arxiv.org/abs/2007.05727v1
- Date: Sat, 11 Jul 2020 09:22:01 GMT
- Title: Feature Selection on Noisy Twitter Short Text Messages for Language
Identification
- Authors: Mohd Zeeshan Ansari, Tanvir Ahmad and Ana Fatima
- Abstract summary: We apply different feature selection algorithms across various learning algorithms in order to analyze the effect of the algorithm.
The methodology focuses on the word level language identification using a novel dataset of 6903 tweets extracted from Twitter.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The task of written language identification involves typically the detection
of the languages present in a sample of text. Moreover, a sequence of text may
not belong to a single inherent language but also may be mixture of text
written in multiple languages. This kind of text is generated in large volumes
from social media platforms due to its flexible and user friendly environment.
Such text contains very large number of features which are essential for
development of statistical, probabilistic as well as other kinds of language
models. The large number of features have rich as well as irrelevant and
redundant features which have diverse effect over the performance of the
learning model. Therefore, feature selection methods are significant in
choosing feature that are most relevant for an efficient model. In this
article, we basically consider the Hindi-English language identification task
as Hindi and English are often two most widely spoken languages of India. We
apply different feature selection algorithms across various learning algorithms
in order to analyze the effect of the algorithm as well as the number of
features on the performance of the task. The methodology focuses on the word
level language identification using a novel dataset of 6903 tweets extracted
from Twitter. Various n-gram profiles are examined with different feature
selection algorithms over many classifiers. Finally, an exhaustive comparative
analysis is put forward with respect to the overall experiments conducted for
the task.
Related papers
- Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - A Measure for Transparent Comparison of Linguistic Diversity in Multilingual NLP Data Sets [1.1647644386277962]
Typologically diverse benchmarks are increasingly created to track the progress achieved in multilingual NLP.
We propose assessing linguistic diversity of a data set against a reference language sample.
arXiv Detail & Related papers (2024-03-06T18:14:22Z) - GradSim: Gradient-Based Language Grouping for Effective Multilingual
Training [13.730907708289331]
We propose GradSim, a language grouping method based on gradient similarity.
Our experiments on three diverse multilingual benchmark datasets show that it leads to the largest performance gains.
Besides linguistic features, the topics of the datasets play an important role for language grouping.
arXiv Detail & Related papers (2023-10-23T18:13:37Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - On Decoding Strategies for Neural Text Generators [73.48162198041884]
We study the interaction between language generation tasks and decoding strategies.
We measure changes in attributes of generated text as a function of both decoding strategy and task.
Our results reveal both previously-observed and surprising findings.
arXiv Detail & Related papers (2022-03-29T16:25:30Z) - A Simple and Efficient Probabilistic Language model for Code-Mixed Text [0.0]
We present a simple probabilistic approach for building efficient word embedding for code-mixed text.
We examine its efficacy for the classification task using bidirectional LSTMs and SVMs.
arXiv Detail & Related papers (2021-06-29T05:37:57Z) - Analysing The Impact Of Linguistic Features On Cross-Lingual Transfer [3.299672391663527]
We analyze a state-of-the-art multilingual model and try to determine what impacts good transfer between languages.
We show that looking at particular syntactic features is 2-4 times more helpful in predicting the performance than an aggregated syntactic similarity.
arXiv Detail & Related papers (2021-05-12T21:22:58Z) - Intrinsic Probing through Dimension Selection [69.52439198455438]
Most modern NLP systems make use of pre-trained contextual representations that attain astonishingly high performance on a variety of tasks.
Such high performance should not be possible unless some form of linguistic structure inheres in these representations, and a wealth of research has sprung up on probing for it.
In this paper, we draw a distinction between intrinsic probing, which examines how linguistic information is structured within a representation, and the extrinsic probing popular in prior work, which only argues for the presence of such information by showing that it can be successfully extracted.
arXiv Detail & Related papers (2020-10-06T15:21:08Z) - Linguistic Typology Features from Text: Inferring the Sparse Features of
World Atlas of Language Structures [73.06435180872293]
We construct a recurrent neural network predictor based on byte embeddings and convolutional layers.
We show that some features from various linguistic types can be predicted reliably.
arXiv Detail & Related papers (2020-04-30T21:00:53Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.