Related papers: A multi-level multi-label text classification dataset of 19th century Ottoman and Russian literary and critical texts

A multi-level multi-label text classification dataset of 19th century Ottoman and Russian literary and critical texts

URL: http://arxiv.org/abs/2407.15136v1
Date: Sun, 21 Jul 2024 12:14:45 GMT
Title: A multi-level multi-label text classification dataset of 19th century Ottoman and Russian literary and critical texts
Authors: Gokcen Gokceoglu, Devrim Cavusoglu, Emre Akbas, Özen Nergis Dolcerocca,
Abstract summary: This paper introduces a multi-level, multi-label text classification dataset comprising over 3000 documents. The dataset features literary and critical texts from 19th-century Ottoman Turkish and Russian. It is the first study to apply large language models (LLMs) to this dataset, sourced from prominent literary periodicals of the era.
Score: 8.405938712823563
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: This paper introduces a multi-level, multi-label text classification dataset comprising over 3000 documents. The dataset features literary and critical texts from 19th-century Ottoman Turkish and Russian. It is the first study to apply large language models (LLMs) to this dataset, sourced from prominent literary periodicals of the era. The texts have been meticulously organized and labeled. This was done according to a taxonomic framework that takes into account both their structural and semantic attributes. Articles are categorized and tagged with bibliometric metadata by human experts. We present baseline classification results using a classical bag-of-words (BoW) naive Bayes model and three modern LLMs: multilingual BERT, Falcon, and Llama-v2. We found that in certain cases, Bag of Words (BoW) outperforms Large Language Models (LLMs), emphasizing the need for additional research, especially in low-resource language settings. This dataset is expected to be a valuable resource for researchers in natural language processing and machine learning, especially for historical and low-resource languages. The dataset is publicly available^1.

Related papers

The taggedPBC: Annotating a massive parallel corpus for crosslinguistic investigations [0.0]
The taggedPBC contains more than 1,800 sentences of pos-tagged parallel text data from over 1,500 languages.<n>The accuracy of tags in this dataset is shown to correlate well with both existing SOTA taggers for high-resource languages.<n>A novel measure derived from this dataset, the N1 ratio, correlates with expert determinations of word order in three typological databases.
arXiv Detail & Related papers (2025-05-18T22:13:32Z)
Adapting Multilingual Embedding Models to Historical Luxembourgish [5.474797258314828]
This study examines multilingual embeddings for cross-lingual semantic search in historical Luxembourgish. We use GPT-4o for sentence segmentation and translation, generating 20,000 parallel training sentences per language pair. We adapt several multilingual embedding models through contrastive learning or knowledge distillation and increase accuracy significantly for all models.
arXiv Detail & Related papers (2025-02-11T20:35:29Z)
The Text Classification Pipeline: Starting Shallow going Deeper [4.97309503788908]
The past decade has seen deep learning revolutionize text classification. English is the predominant language of focus, despite studies involving Arabic, Chinese, Hindi, and others. This work integrates traditional and contemporary text mining methodologies, fostering a holistic understanding of text classification.
arXiv Detail & Related papers (2024-12-30T23:01:19Z)
Universal Cross-Lingual Text Classification [0.3958317527488535]
This research proposes a novel perspective on Universal Cross-Lingual Text Classification. Our approach involves blending supervised data from different languages during training to create a universal model. The primary goal is to enhance label and language coverage, aiming for a label set that represents a union of labels from various languages.
arXiv Detail & Related papers (2024-06-16T17:58:29Z)
Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings. An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts) This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z)
Izindaba-Tindzaba: Machine learning news categorisation for Long and Short Text for isiZulu and Siswati [1.666378501554705]
Local/Native South African languages are classified as low-resource languages. In this work, the focus was to create annotated news datasets for the isiZulu and Siswati native languages.
arXiv Detail & Related papers (2023-06-12T21:02:12Z)
T3L: Translate-and-Test Transfer Learning for Cross-Lingual Text Classification [50.675552118811]
Cross-lingual text classification is typically built on large-scale, multilingual language models (LMs) pretrained on a variety of languages of interest. We propose revisiting the classic "translate-and-test" pipeline to neatly separate the translation and classification stages.
arXiv Detail & Related papers (2023-06-08T07:33:22Z)
Taxi1500: A Multilingual Dataset for Text Classification in 1500 Languages [40.01333053375582]
We aim to create a text classification dataset encompassing a large number of languages. We leverage parallel translations of the Bible to construct such a dataset. By annotating the English side of the data and projecting the labels onto other languages through aligned verses, we generate text classification datasets for more than 1500 languages.
arXiv Detail & Related papers (2023-05-15T09:43:32Z)
Neural Label Search for Zero-Shot Multi-Lingual Extractive Summarization [80.94424037751243]
In zero-shot multilingual extractive text summarization, a model is typically trained on English dataset and then applied on summarization datasets of other languages. We propose NLS (Neural Label Search for Summarization), which jointly learns hierarchical weights for different sets of labels together with our summarization model. We conduct multilingual zero-shot summarization experiments on MLSUM and WikiLingua datasets, and we achieve state-of-the-art results using both human and automatic evaluations.
arXiv Detail & Related papers (2022-04-28T14:02:16Z)
SCROLLS: Standardized CompaRison Over Long Language Sequences [62.574959194373264]
We introduce SCROLLS, a suite of tasks that require reasoning over long texts. SCROLLS contains summarization, question answering, and natural language inference tasks. We make all datasets available in a unified text-to-text format and host a live leaderboard to facilitate research on model architecture and pretraining methods.
arXiv Detail & Related papers (2022-01-10T18:47:15Z)
Sentiment analysis in tweets: an assessment study from classical to modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information. Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks. This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z)
An Amharic News Text classification Dataset [0.0]
We aim to introduce the Amharic text classification dataset that consists of more than 50k news articles that were categorized into 6 classes. This dataset is made available with easy baseline performances to encourage studies and better performance experiments.
arXiv Detail & Related papers (2021-03-10T16:36:39Z)
Minimally-Supervised Structure-Rich Text Categorization via Learning on Text-Rich Networks [61.23408995934415]
We propose a novel framework for minimally supervised categorization by learning from the text-rich network. Specifically, we jointly train two modules with different inductive biases -- a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning. Our experiments show that given only three seed documents per category, our framework can achieve an accuracy of about 92%.
arXiv Detail & Related papers (2021-02-23T04:14:34Z)
Deep Learning for Hindi Text Classification: A Comparison [6.8629257716723]
The research in the classification of morphologically rich and low resource Hindi language written in Devanagari script has been limited due to the absence of large labeled corpus. In this work, we used translated versions of English data-sets to evaluate models based on CNN, LSTM and Attention. The paper also serves as a tutorial for popular text classification techniques.
arXiv Detail & Related papers (2020-01-19T09:29:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.