L3Cube-MahaSocialNER: A Social Media based Marathi NER Dataset and BERT
models
- URL: http://arxiv.org/abs/2401.00170v1
- Date: Sat, 30 Dec 2023 08:30:24 GMT
- Title: L3Cube-MahaSocialNER: A Social Media based Marathi NER Dataset and BERT
models
- Authors: Harsh Chaudhari, Anuja Patil, Dhanashree Lavekar, Pranav Khairnar,
Raviraj Joshi
- Abstract summary: The L3Cube-MahaSocialNER dataset is the first and largest social media dataset specifically designed for Named Entity Recognition (NER) in the Marathi language.
The dataset comprises 18,000 manually labeled sentences covering eight entity classes.
Deep learning models, including CNN, LSTM, BiLSTM, and Transformer models, are evaluated on the individual dataset with IOB and non-IOB notations.
- Score: 1.8624310307965966
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work introduces the L3Cube-MahaSocialNER dataset, the first and largest
social media dataset specifically designed for Named Entity Recognition (NER)
in the Marathi language. The dataset comprises 18,000 manually labeled
sentences covering eight entity classes, addressing challenges posed by social
media data, including non-standard language and informal idioms. Deep learning
models, including CNN, LSTM, BiLSTM, and Transformer models, are evaluated on
the individual dataset with IOB and non-IOB notations. The results demonstrate
the effectiveness of these models in accurately recognizing named entities in
Marathi informal text. The L3Cube-MahaSocialNER dataset offers user-centric
information extraction and supports real-time applications, providing a
valuable resource for public opinion analysis, news, and marketing on social
media platforms. We also show that the zero-shot results of the regular NER
model are poor on the social NER test set thus highlighting the need for more
social NER datasets. The datasets and models are publicly available at
https://github.com/l3cube-pune/MarathiNLP
Related papers
- RedPajama: an Open Dataset for Training Large Language Models [80.74772646989423]
We identify three core data-related challenges that must be addressed to advance open-source language models.
These include (1) transparency in model development, including the data curation process, (2) access to large quantities of high-quality data, and (3) availability of artifacts and metadata for dataset curation and analysis.
We release RedPajama-V1, an open reproduction of the LLaMA training dataset, and RedPajama-V2, a massive web-only dataset consisting of raw, unfiltered text data together with quality signals and metadata.
arXiv Detail & Related papers (2024-11-19T09:35:28Z) - NERetrieve: Dataset for Next Generation Named Entity Recognition and
Retrieval [49.827932299460514]
We argue that capabilities provided by large language models are not the end of NER research, but rather an exciting beginning.
We present three variants of the NER task, together with a dataset to support them.
We provide a large, silver-annotated corpus of 4 million paragraphs covering 500 entity types.
arXiv Detail & Related papers (2023-10-22T12:23:00Z) - My Boli: Code-mixed Marathi-English Corpora, Pretrained Language Models
and Evaluation Benchmarks [0.7874708385247353]
We focus on the low-resource Indian language Marathi which lacks any prior work in code-mixing.
We present L3Cube-MeCorpus, a large code-mixed Marathi-English (Mr-En) corpus with 10 million social media sentences for pretraining.
We also release L3Cube-MeBERT and MeRoBERTa, code-mixed BERT-based transformer models pre-trained on MeCorpus.
arXiv Detail & Related papers (2023-06-24T18:17:38Z) - Constructing Colloquial Dataset for Persian Sentiment Analysis of Social
Microblogs [0.0]
This paper first constructs a user opinion dataset called ITRC-Opinion in a collaborative environment and insource way.
Our dataset contains 60,000 informal and colloquial Persian texts from social microblogs such as Twitter and Instagram.
Second, this study proposes a new architecture based on the convolutional neural network (CNN) model for more effective sentiment analysis of colloquial text in social microblog posts.
arXiv Detail & Related papers (2023-06-22T05:51:22Z) - ManiTweet: A New Benchmark for Identifying Manipulation of News on Social Media [74.93847489218008]
We present a novel task, identifying manipulation of news on social media, which aims to detect manipulation in social media posts and identify manipulated or inserted information.
To study this task, we have proposed a data collection schema and curated a dataset called ManiTweet, consisting of 3.6K pairs of tweets and corresponding articles.
Our analysis demonstrates that this task is highly challenging, with large language models (LLMs) yielding unsatisfactory performance.
arXiv Detail & Related papers (2023-05-23T16:40:07Z) - XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented
Languages [105.54207724678767]
Data scarcity is a crucial issue for the development of highly multilingual NLP systems.
We propose XTREME-UP, a benchmark defined by its focus on the scarce-data scenario rather than zero-shot.
XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies.
arXiv Detail & Related papers (2023-05-19T18:00:03Z) - L3Cube-MahaNER: A Marathi Named Entity Recognition Dataset and BERT
models [0.7874708385247353]
We focus on Marathi, an Indian language, spoken prominently by the people of Maharashtra state.
We present L3Cube-MahaNER, the first major gold standard named entity recognition dataset in Marathi.
In the end, we benchmark the dataset on different CNN, LSTM, and Transformer based models like mBERT, XLM-RoBERTa, IndicBERT, MahaBERT, etc.
arXiv Detail & Related papers (2022-04-12T18:32:15Z) - L3Cube-MahaHate: A Tweet-based Marathi Hate Speech Detection Dataset and
BERT models [0.7874708385247353]
In India, Marathi is one of the most popular languages used by a wide audience.
In this work, we present L3Cube-MahaHate, the first major Hate Speech dataset in Marathi.
arXiv Detail & Related papers (2022-03-25T17:00:33Z) - An Open-Source Dataset and A Multi-Task Model for Malay Named Entity
Recognition [3.511753382329252]
We build a Malay NER dataset (MYNER) comprising 28,991 sentences (over 384 thousand tokens)
An auxiliary task, boundary detection, is introduced to improve NER training in both explicit and implicit ways.
arXiv Detail & Related papers (2021-09-03T03:29:25Z) - MobIE: A German Dataset for Named Entity Recognition, Entity Linking and
Relation Extraction in the Mobility Domain [76.21775236904185]
dataset consists of 3,232 social media texts and traffic reports with 91K tokens, and contains 20.5K annotated entities.
A subset of the dataset is human-annotated with seven mobility-related, n-ary relation types.
To the best of our knowledge, this is the first German-language dataset that combines annotations for NER, EL and RE.
arXiv Detail & Related papers (2021-08-16T08:21:50Z) - Named Entity Recognition for Social Media Texts with Semantic
Augmentation [70.44281443975554]
Existing approaches for named entity recognition suffer from data sparsity problems when conducted on short and informal texts.
We propose a neural-based approach to NER for social media texts where both local (from running text) and augmented semantics are taken into account.
arXiv Detail & Related papers (2020-10-29T10:06:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.