Mukhyansh: A Headline Generation Dataset for Indic Languages
- URL: http://arxiv.org/abs/2311.17743v1
- Date: Wed, 29 Nov 2023 15:49:24 GMT
- Title: Mukhyansh: A Headline Generation Dataset for Indic Languages
- Authors: Lokesh Madasu, Gopichand Kanumolu, Nirmal Surange, Manish Shrivastava
- Abstract summary: Mukhyansh is an extensive multilingual dataset, tailored for Indian language headline generation.
Comprising over 3.39 million article-headline pairs, Mukhyansh spans across eight prominent Indian languages.
Mukhyansh outperforms all other models, achieving an average ROUGE-L score of 31.43 across all 8 languages.
- Score: 4.583536403673757
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The task of headline generation within the realm of Natural Language
Processing (NLP) holds immense significance, as it strives to distill the true
essence of textual content into concise and attention-grabbing summaries. While
noteworthy progress has been made in headline generation for widely spoken
languages like English, there persist numerous challenges when it comes to
generating headlines in low-resource languages, such as the rich and diverse
Indian languages. A prominent obstacle that specifically hinders headline
generation in Indian languages is the scarcity of high-quality annotated data.
To address this crucial gap, we proudly present Mukhyansh, an extensive
multilingual dataset, tailored for Indian language headline generation.
Comprising an impressive collection of over 3.39 million article-headline
pairs, Mukhyansh spans across eight prominent Indian languages, namely Telugu,
Tamil, Kannada, Malayalam, Hindi, Bengali, Marathi, and Gujarati. We present a
comprehensive evaluation of several state-of-the-art baseline models.
Additionally, through an empirical analysis of existing works, we demonstrate
that Mukhyansh outperforms all other models, achieving an impressive average
ROUGE-L score of 31.43 across all 8 languages.
Related papers
- NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Breaking Language Barriers: A Question Answering Dataset for Hindi and
Marathi [1.03590082373586]
This paper focuses on developing a Question Answering dataset for two such languages- Hindi and Marathi.
Despite Hindi being the 3rd most spoken language worldwide, and Marathi being the 11th most spoken language globally, both languages face limited resources for building efficient Question Answering systems.
We release the largest Question-Answering dataset available for these languages, with each dataset containing 28,000 samples.
arXiv Detail & Related papers (2023-08-19T00:39:21Z) - Multi-lingual and Multi-cultural Figurative Language Understanding [69.47641938200817]
Figurative language permeates human communication, but is relatively understudied in NLP.
We create a dataset for seven diverse languages associated with a variety of cultures: Hindi, Indonesian, Javanese, Kannada, Sundanese, Swahili and Yoruba.
Our dataset reveals that each language relies on cultural and regional concepts for figurative expressions, with the highest overlap between languages originating from the same region.
All languages exhibit a significant deficiency compared to English, with variations in performance reflecting the availability of pre-training and fine-tuning data.
arXiv Detail & Related papers (2023-05-25T15:30:31Z) - PMIndiaSum: Multilingual and Cross-lingual Headline Summarization for
Languages in India [33.31556860332746]
PMIndiaSum is a multilingual and massively parallel summarization corpus focused on languages in India.
Our corpus provides a training and testing ground for four language families, 14 languages, and the largest to date with 196 language pairs.
arXiv Detail & Related papers (2023-05-15T17:41:15Z) - Making a MIRACL: Multilingual Information Retrieval Across a Continuum
of Languages [62.730361829175415]
MIRACL is a multilingual dataset we have built for the WSDM 2023 Cup challenge.
It focuses on ad hoc retrieval across 18 different languages.
Our goal is to spur research that will improve retrieval across a continuum of languages.
arXiv Detail & Related papers (2022-10-18T16:47:18Z) - NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local
Languages [100.59889279607432]
We focus on developing resources for languages in Indonesia.
Most languages in Indonesia are categorized as endangered and some are even extinct.
We develop the first-ever parallel resource for 10 low-resource languages in Indonesia.
arXiv Detail & Related papers (2022-05-31T17:03:50Z) - MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages [76.93265104421559]
We benchmark code generation from natural language commands extending beyond English.
We annotated a total of 896 NL-code pairs in three languages: Spanish, Japanese, and Russian.
While the difficulties vary across these three languages, all systems lag significantly behind their English counterparts.
arXiv Detail & Related papers (2022-03-16T04:21:50Z) - "A Passage to India": Pre-trained Word Embeddings for Indian Languages [30.607474624873014]
We use various existing approaches to create multiple word embeddings for 14 Indian languages.
We place these embeddings for all these languages in a single repository.
We release a total of 436 models using 8 different approaches.
arXiv Detail & Related papers (2021-12-27T17:31:04Z) - Harnessing Cross-lingual Features to Improve Cognate Detection for
Low-resource Languages [50.82410844837726]
We demonstrate the use of cross-lingual word embeddings for detecting cognates among fourteen Indian languages.
We evaluate our methods to detect cognates on a challenging dataset of twelve Indian languages.
We observe an improvement of up to 18% points, in terms of F-score, for cognate detection.
arXiv Detail & Related papers (2021-12-16T11:17:58Z) - HinFlair: pre-trained contextual string embeddings for pos tagging and
text classification in the Hindi language [0.0]
HinFlair is a language representation model (contextual string embeddings) pre-trained on a large monolingual Hindi corpus.
Results show that HinFlair outperforms previous state-of-the-art publicly available pre-trained embeddings for downstream tasks like text classification and pos tagging.
arXiv Detail & Related papers (2021-01-18T09:23:35Z) - Anubhuti -- An annotated dataset for emotional analysis of Bengali short
stories [2.3424047967193826]
Anubhuti is the first and largest text corpus for analyzing emotions expressed by writers of Bengali short stories.
We explain the data collection methods, the manual annotation process and the resulting high inter-annotator agreement.
We have verified the performance of our dataset with baseline Machine Learning and a Deep Learning model for emotion classification.
arXiv Detail & Related papers (2020-10-06T22:33:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.