Related papers: Keyword Extraction, and Aspect Classification in Sinhala, English, and Code-Mixed Content

Keyword Extraction, and Aspect Classification in Sinhala, English, and Code-Mixed Content

URL: http://arxiv.org/abs/2504.10679v1
Date: Mon, 14 Apr 2025 20:01:34 GMT
Title: Keyword Extraction, and Aspect Classification in Sinhala, English, and Code-Mixed Content
Authors: F. A. Rizvi, T. Navojith, A. M. N. H. Adhikari, W. P. U. Senevirathna, Dharshana Kasthurirathna, Lakmini Abeywardhana,
Abstract summary: This study introduces a hybrid NLP method to improve keyword extraction, content filtering, and aspect-based classification of banking content.<n>The present framework offers an accurate and scalable solution for brand reputation monitoring in code-mixed and low-resource banking environments.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Brand reputation in the banking sector is maintained through insightful analysis of customer opinion on code-mixed and multilingual content. Conventional NLP models misclassify or ignore code-mixed text, when mix with low resource languages such as Sinhala-English and fail to capture domain-specific knowledge. This study introduces a hybrid NLP method to improve keyword extraction, content filtering, and aspect-based classification of banking content. Keyword extraction in English is performed with a hybrid approach comprising a fine-tuned SpaCy NER model, FinBERT-based KeyBERT embeddings, YAKE, and EmbedRank, which results in a combined accuracy of 91.2%. Code-mixed and Sinhala keywords are extracted using a fine-tuned XLM-RoBERTa model integrated with a domain-specific Sinhala financial vocabulary, and it results in an accuracy of 87.4%. To ensure data quality, irrelevant comment filtering was performed using several models, with the BERT-base-uncased model achieving 85.2% for English and XLM-RoBERTa 88.1% for Sinhala, which was better than GPT-4o, SVM, and keyword-based filtering. Aspect classification followed the same pattern, with the BERT-base-uncased model achieving 87.4% for English and XLM-RoBERTa 85.9% for Sinhala, both exceeding GPT-4 and keyword-based approaches. These findings confirm that fine-tuned transformer models outperform traditional methods in multilingual financial text analysis. The present framework offers an accurate and scalable solution for brand reputation monitoring in code-mixed and low-resource banking environments.

Related papers

VietMix: A Naturally Occurring Vietnamese-English Code-Mixed Corpus with Iterative Augmentation for Machine Translation [13.047103277038175]
Machine translation systems fail when processing code-mixed inputs for low-resource languages.<n>We address this challenge by curating VietMix, a parallel corpus of naturally occurring code-mixed Vietnamese text paired with expert English translations.<n>Augmenting this resource, we developed a complementary synthetic data generation pipeline.
arXiv Detail & Related papers (2025-05-30T11:18:10Z)
Communication-Efficient Hybrid Language Model via Uncertainty-Aware Opportunistic and Compressed Transmission [65.17811759381978]
Hybrid language model (HLM) generates draft tokens that are validated and corrected by a remote large language model (LLM)<n>We propose communication-efficient and uncertainty-aware HLM (CU-HLM)<n>We show that CU-HLM achieves up to 206$times$ higher token throughput by skipping 74.8% transmissions with 97.4% vocabulary compression, while maintaining 97.4% accuracy.
arXiv Detail & Related papers (2025-05-17T02:10:34Z)
Enhancing Multilingual Sentiment Analysis with Explainability for Sinhala, English, and Code-Mixed Content [0.0]
Existing models struggle with low-resource languages like Sinhala and lack interpretability for practical use. This research develops a hybrid aspect-based sentiment analysis framework that enhances multilingual capabilities with explainable outputs.
arXiv Detail & Related papers (2025-04-18T08:21:12Z)
HYBRINFOX at CheckThat! 2024 -- Task 2: Enriching BERT Models with the Expert System VAGO for Subjectivity Detection [0.8083061106940517]
The HYBRINFOX method ranked 1st with a macro F1 score of 0.7442 on the evaluation data. We explain the principles of our hybrid approach, and outline ways in which the method could be improved for other languages besides English.
arXiv Detail & Related papers (2024-07-04T09:29:19Z)
DataComp-LM: In search of the next generation of training sets for language models [200.5293181577585]
DataComp for Language Models (DCLM) is a testbed for controlled dataset experiments with the goal of improving language models. We provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters.
arXiv Detail & Related papers (2024-06-17T17:42:57Z)
From Human Judgements to Predictive Models: Unravelling Acceptability in Code-Mixed Sentences [18.53327811304381]
Modelling human judgements for the acceptability of code-mixed text can help in distinguishing natural code-mixed text. Cline is the largest of its kind with 16,642 sentences, consisting of samples sourced from two sources. Experiments using Cline demonstrate that simple Multilayer Perceptron (MLP) models trained solely on code-mixing metrics are outperformed by fine-tuned Multilingual Large Language Models (MLLMs)
arXiv Detail & Related papers (2024-05-09T06:40:39Z)
Breaking Language Barriers in Multilingual Mathematical Reasoning: Insights and Observations [59.056367787688146]
This paper pioneers exploring and training powerful Multilingual Math Reasoning (xMR) LLMs. We construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages. By utilizing translation, we construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages.
arXiv Detail & Related papers (2023-10-31T08:09:20Z)
Rethinking Masked Language Modeling for Chinese Spelling Correction [70.85829000570203]
We study Chinese Spelling Correction (CSC) as a joint decision made by two separate models: a language model and an error model. We find that fine-tuning BERT tends to over-fit the error model while under-fit the language model, resulting in poor generalization to out-of-distribution error patterns. We demonstrate that a very simple strategy, randomly masking 20% non-error tokens from the input sequence during fine-tuning is sufficient for learning a much better language model without sacrificing the error model.
arXiv Detail & Related papers (2023-05-28T13:19:12Z)
Comparative Study of Pre-Trained BERT Models for Code-Mixed Hindi-English Data [0.7874708385247353]
"Code Mixed" refers to the use of more than one language in the same text. In this work, we focus on low-resource Hindi-English code-mixed language. We report state-of-the-art results on respective datasets using HingBERT-based models.
arXiv Detail & Related papers (2023-05-25T05:10:28Z)
CROP: Zero-shot Cross-lingual Named Entity Recognition with Multilingual Labeled Sequence Translation [113.99145386490639]
Cross-lingual NER can transfer knowledge between languages via aligned cross-lingual representations or machine translation results. We propose a Cross-lingual Entity Projection framework (CROP) to enable zero-shot cross-lingual NER. We adopt a multilingual labeled sequence translation model to project the tagged sequence back to the target language and label the target raw sentence.
arXiv Detail & Related papers (2022-10-13T13:32:36Z)
Using BERT Encoding to Tackle the Mad-lib Attack in SMS Spam Detection [0.0]
We investigate whether language models sensitive to the semantics and context of words, such as Google's BERT, may be useful to overcome this adversarial attack. Using a dataset of 5572 SMS spam messages, we first established a baseline of detection performance. Then, we built a thesaurus of the vocabulary contained in these messages, and set up a Mad-lib attack experiment. We found that the classic models achieved a 94% Balanced Accuracy (BA) in the original dataset, whereas the BERT model obtained 96%.
arXiv Detail & Related papers (2021-07-13T21:17:57Z)
Incorporating BERT into Parallel Sequence Decoding with Adapters [82.65608966202396]
We propose to take two different BERT models as the encoder and decoder respectively, and fine-tune them by introducing simple and lightweight adapter modules. We obtain a flexible and efficient model which is able to jointly leverage the information contained in the source-side and target-side BERT models. Our framework is based on a parallel sequence decoding algorithm named Mask-Predict considering the bi-directional and conditional independent nature of BERT.
arXiv Detail & Related papers (2020-10-13T03:25:15Z)
DeBERTa: Decoding-enhanced BERT with Disentangled Attention [119.77305080520718]
We propose a new model architecture DeBERTa that improves the BERT and RoBERTa models using two novel techniques. We show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understanding (NLU) and natural langauge generation (NLG) downstream tasks.
arXiv Detail & Related papers (2020-06-05T19:54:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.