BenCoref: A Multi-Domain Dataset of Nominal Phrases and Pronominal
Reference Annotations
- URL: http://arxiv.org/abs/2304.03682v3
- Date: Mon, 3 Jul 2023 18:33:23 GMT
- Title: BenCoref: A Multi-Domain Dataset of Nominal Phrases and Pronominal
Reference Annotations
- Authors: Shadman Rohan, Mojammel Hossain, Mohammad Mamun Or Rashid, Nabeel
Mohammed
- Abstract summary: We introduce a new dataset, BenCoref, comprising coreference annotations for Bengali texts gathered from four distinct domains.
This relatively small dataset contains 5200 mention annotations forming 502 mention clusters within 48,569 tokens.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Coreference Resolution is a well studied problem in NLP. While widely studied
for English and other resource-rich languages, research on coreference
resolution in Bengali largely remains unexplored due to the absence of relevant
datasets. Bengali, being a low-resource language, exhibits greater
morphological richness compared to English. In this article, we introduce a new
dataset, BenCoref, comprising coreference annotations for Bengali texts
gathered from four distinct domains. This relatively small dataset contains
5200 mention annotations forming 502 mention clusters within 48,569 tokens. We
describe the process of creating this dataset and report performance of
multiple models trained using BenCoref. We expect that our work provides some
valuable insights on the variations in coreference phenomena across several
domains in Bengali and encourages the development of additional resources for
Bengali. Furthermore, we found poor crosslingual performance at zero-shot
setting from English, highlighting the need for more language-specific
resources for this task.
Related papers
- BeliN: A Novel Corpus for Bengali Religious News Headline Generation using Contextual Feature Fusion [1.2416206871977309]
Existing approaches to headline generation typically rely solely on the article content, overlooking crucial contextual features such as sentiment, category, and aspect.
This study addresses this limitation by introducing a novel corpus, BeliN (Bengali Religious News)
It comprises religious news articles from prominent Bangladeshi online newspapers, and MultiGen - a contextual multi-input feature fusion headline generation approach.
arXiv Detail & Related papers (2025-01-02T05:34:21Z) - Constructing and Expanding Low-Resource and Underrepresented Parallel Datasets for Indonesian Local Languages [0.0]
We introduce Bhinneka Korpus, a multilingual parallel corpus featuring five Indonesian local languages.
Our goal is to enhance access and utilization of these resources, extending their reach within the country.
arXiv Detail & Related papers (2024-04-01T09:24:06Z) - Aya Dataset: An Open-Access Collection for Multilingual Instruction
Tuning [49.79783940841352]
Existing datasets are almost all in the English language.
We work with fluent speakers of languages from around the world to collect natural instances of instructions and completions.
We create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages.
arXiv Detail & Related papers (2024-02-09T18:51:49Z) - Zero-shot Sentiment Analysis in Low-Resource Languages Using a
Multilingual Sentiment Lexicon [78.12363425794214]
We focus on zero-shot sentiment analysis tasks across 34 languages, including 6 high/medium-resource languages, 25 low-resource languages, and 3 code-switching datasets.
We demonstrate that pretraining using multilingual lexicons, without using any sentence-level sentiment data, achieves superior zero-shot performance compared to models fine-tuned on English sentiment datasets.
arXiv Detail & Related papers (2024-02-03T10:41:05Z) - Multilingual Word Embeddings for Low-Resource Languages using Anchors
and a Chain of Related Languages [54.832599498774464]
We propose to build multilingual word embeddings (MWEs) via a novel language chain-based approach.
We build MWEs one language at a time by starting from the resource rich source and sequentially adding each language in the chain till we reach the target.
We evaluate our method on bilingual lexicon induction for 4 language families, involving 4 very low-resource (5M tokens) and 4 moderately low-resource (50M) target languages.
arXiv Detail & Related papers (2023-11-21T09:59:29Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Bengali Handwritten Grapheme Classification: Deep Learning Approach [0.0]
We participate in a Kaggle competition citek_link where the challenge is to classify three constituent elements of a Bengali grapheme in the image.
We explore the performances of some existing neural network models such as Multi-Layer Perceptron (MLP) and state of the art ResNet50.
We propose our own convolution neural network (CNN) model for Bengali grapheme classification with validation root accuracy 95.32%, vowel accuracy 98.61%, and consonant accuracy 98.76%.
arXiv Detail & Related papers (2021-11-16T06:14:59Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Anubhuti -- An annotated dataset for emotional analysis of Bengali short
stories [2.3424047967193826]
Anubhuti is the first and largest text corpus for analyzing emotions expressed by writers of Bengali short stories.
We explain the data collection methods, the manual annotation process and the resulting high inter-annotator agreement.
We have verified the performance of our dataset with baseline Machine Learning and a Deep Learning model for emotion classification.
arXiv Detail & Related papers (2020-10-06T22:33:58Z) - Soft Gazetteers for Low-Resource Named Entity Recognition [78.00856159473393]
We propose a method of "soft gazetteers" that incorporates ubiquitously available information from English knowledge bases into neural named entity recognition models.
Our experiments on four low-resource languages show an average improvement of 4 points in F1 score.
arXiv Detail & Related papers (2020-05-04T21:58:02Z) - A Continuous Space Neural Language Model for Bengali Language [0.4799822253865053]
This paper proposes a continuous-space neural language model, or more specifically an ASGD weight dropped LSTM language model, along with techniques to efficiently train it for Bengali Language.
The proposed architecture outperforms its counterparts by achieving an inference perplexity as low as 51.2 on the held out data set for Bengali.
arXiv Detail & Related papers (2020-01-11T14:50:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.