On Language Clustering: A Non-parametric Statistical Approach
- URL: http://arxiv.org/abs/2209.06720v1
- Date: Wed, 14 Sep 2022 15:27:41 GMT
- Title: On Language Clustering: A Non-parametric Statistical Approach
- Authors: Anagh Chattopadhyay, Soumya Sankar Ghosh, Samir Karmakar
- Abstract summary: This study presents statistical approaches that may be employed in nonparametric nonhomogeneous data frameworks.
It also examines their application in the field of natural language processing and language clustering.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Any approach aimed at pasteurizing and quantifying a particular phenomenon
must include the use of robust statistical methodologies for data analysis.
With this in mind, the purpose of this study is to present statistical
approaches that may be employed in nonparametric nonhomogeneous data
frameworks, as well as to examine their application in the field of natural
language processing and language clustering. Furthermore, this paper discusses
the many uses of nonparametric approaches in linguistic data mining and
processing. The data depth idea allows for the centre-outward ordering of
points in any dimension, resulting in a new nonparametric multivariate
statistical analysis that does not require any distributional assumptions. The
concept of hierarchy is used in historical language categorisation and
structuring, and it aims to organise and cluster languages into subfamilies
using the same premise. In this regard, the current study presents a novel
approach to language family structuring based on non-parametric approaches
produced from a typological structure of words in various languages, which is
then converted into a Cartesian framework using MDS. This
statistical-depth-based architecture allows for the use of data-depth-based
methodologies for robust outlier detection, which is extremely useful in
understanding the categorization of diverse borderline languages and allows for
the re-evaluation of existing classification systems. Other depth-based
approaches are also applied to processes such as unsupervised and supervised
clustering. This paper therefore provides an overview of procedures that can be
applied to nonhomogeneous language classification systems in a nonparametric
framework.
Related papers
- Explaining Datasets in Words: Statistical Models with Natural Language Parameters [66.69456696878842]
We introduce a family of statistical models -- including clustering, time series, and classification models -- parameterized by natural language predicates.
We apply our framework to a wide range of problems: taxonomizing user chat dialogues, characterizing how they evolve across time, finding categories where one language model is better than the other.
arXiv Detail & Related papers (2024-09-13T01:40:20Z) - Comparison between parameter-efficient techniques and full fine-tuning: A case study on multilingual news article classification [4.498100922387482]
Adapters and Low-Rank Adaptation (LoRA) are parameter-efficient fine-tuning techniques designed to make the training of language models more efficient.
Previous results demonstrated that these methods can even improve performance on some classification tasks.
This paper investigates how these techniques influence the classification performance and computation costs compared to full fine-tuning.
arXiv Detail & Related papers (2023-08-14T17:12:43Z) - Constructing Word-Context-Coupled Space Aligned with Associative
Knowledge Relations for Interpretable Language Modeling [0.0]
The black-box structure of the deep neural network in pre-trained language models seriously limits the interpretability of the language modeling process.
A Word-Context-Coupled Space (W2CSpace) is proposed by introducing the alignment processing between uninterpretable neural representation and interpretable statistical logic.
Our language model can achieve better performance and highly credible interpretable ability compared to related state-of-the-art methods.
arXiv Detail & Related papers (2023-05-19T09:26:02Z) - Learning Mutual Fund Categorization using Natural Language Processing [0.5249805590164901]
We learn the categorization system directly from the unstructured data as depicted in the forms using natural language processing (NLP)
We show that the categorization system can indeed be learned with high accuracy.
arXiv Detail & Related papers (2022-07-11T15:40:18Z) - Capturing Structural Locality in Non-parametric Language Models [85.94669097485992]
We propose a simple yet effective approach for adding locality information into non-parametric language models.
Experiments on two different domains, Java source code and Wikipedia text, demonstrate that locality features improve model efficacy.
arXiv Detail & Related papers (2021-10-06T15:53:38Z) - Discrete representations in neural models of spoken language [56.29049879393466]
We compare the merits of four commonly used metrics in the context of weakly supervised models of spoken language.
We find that the different evaluation metrics can give inconsistent results.
arXiv Detail & Related papers (2021-05-12T11:02:02Z) - Interpretable Multi-dataset Evaluation for Named Entity Recognition [110.64368106131062]
We present a general methodology for interpretable evaluation for the named entity recognition (NER) task.
The proposed evaluation method enables us to interpret the differences in models and datasets, as well as the interplay between them.
By making our analysis tool available, we make it easy for future researchers to run similar analyses and drive progress in this area.
arXiv Detail & Related papers (2020-11-13T10:53:27Z) - Leveraging Class Hierarchies with Metric-Guided Prototype Learning [5.070542698701158]
In many classification tasks, the set of target classes can be organized into a hierarchy.
This structure induces a semantic distance between classes, and can be summarised under the form of a cost matrix.
We propose to model the hierarchical class structure by integrating this metric in the supervision of a prototypical network.
arXiv Detail & Related papers (2020-07-06T20:22:08Z) - The Paradigm Discovery Problem [121.79963594279893]
We formalize the paradigm discovery problem and develop metrics for judging systems.
We report empirical results on five diverse languages.
Our code and data are available for public use.
arXiv Detail & Related papers (2020-05-04T16:38:54Z) - A Call for More Rigor in Unsupervised Cross-lingual Learning [76.6545568416577]
An existing rationale for such research is based on the lack of parallel data for many of the world's languages.
We argue that a scenario without any parallel data and abundant monolingual data is unrealistic in practice.
arXiv Detail & Related papers (2020-04-30T17:06:23Z) - Adaptive Discrete Smoothing for High-Dimensional and Nonlinear Panel
Data [4.550919471480445]
We develop a data-driven smoothing technique for high-dimensional and non-linear panel data models.
The weights are determined by a data-driven way and depend on the similarity between the corresponding functions.
We conduct a simulation study which shows that the prediction can be greatly improved by using our estimator.
arXiv Detail & Related papers (2019-12-30T09:50:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.