MahaParaphrase: A Marathi Paraphrase Detection Corpus and BERT-based Models
- URL: http://arxiv.org/abs/2508.17444v1
- Date: Sun, 24 Aug 2025 16:48:58 GMT
- Title: MahaParaphrase: A Marathi Paraphrase Detection Corpus and BERT-based Models
- Authors: Suramya Jadhav, Abhay Shanbhag, Amogh Thakurdesai, Ridhima Sinare, Ananya Joshi, Raviraj Joshi,
- Abstract summary: Indic languages are complex in natural language processing due to their rich morphological and syntactic variations, diverse scripts, and limited availability of annotated data.<n>In this work, we present the L3Cube-MahaParaphrase dataset, a high-quality paraphrase corpus for Marathi, a low resource Indic language, consisting of 8,000 sentence pairs, each annotated by human experts as either Paraphrase (P) or Non-paraphrase (NP)<n>We also present the results of standard transformer-based BERT models on these datasets.
- Score: 6.841396630034347
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Paraphrases are a vital tool to assist language understanding tasks such as question answering, style transfer, semantic parsing, and data augmentation tasks. Indic languages are complex in natural language processing (NLP) due to their rich morphological and syntactic variations, diverse scripts, and limited availability of annotated data. In this work, we present the L3Cube-MahaParaphrase Dataset, a high-quality paraphrase corpus for Marathi, a low resource Indic language, consisting of 8,000 sentence pairs, each annotated by human experts as either Paraphrase (P) or Non-paraphrase (NP). We also present the results of standard transformer-based BERT models on these datasets. The dataset and model are publicly shared at https://github.com/l3cube-pune/MarathiNLP
Related papers
- L3Cube-IndicHeadline-ID: A Dataset for Headline Identification and Semantic Evaluation in Low-Resource Indian Languages [2.584263027095689]
L3Cube-IndicHeadline-ID is a curated dataset spanning ten low-resource Indic languages.<n>Each language includes 20,000 news articles paired with four headline variants.<n>The task requires selecting the correct headline from the options using article-headline similarity.<n>We benchmark several sentence transformers, including multilingual and language-specific models, using cosine similarity.
arXiv Detail & Related papers (2025-09-02T16:54:30Z) - End-to-End Speech Translation for Low-Resource Languages Using Weakly Labeled Data [5.950263765640278]
This paper explores the hypothesis that weakly labeled data can be used to build speech-to-text translation models.<n>We constructed datasets with the help of bitext mining using state-of-the-art sentence encoders.<n>Results demonstrate that ST systems can be built using weakly labeled data, with performance comparable to massive multi-modal multilingual baselines.
arXiv Detail & Related papers (2025-06-19T12:11:01Z) - COMI-LINGUA: Expert Annotated Large-Scale Dataset for Multitask NLP in Hindi-English Code-Mixing [1.3062731746155414]
COMI-LINGUA is the largest manually annotated Hindi-English code-mixed dataset.<n>It comprises 125K+ high-quality instances across five core NLP tasks.<n>Each instance is annotated by three bilingual annotators, yielding over 376K expert annotations.
arXiv Detail & Related papers (2025-03-27T16:36:39Z) - One Model is All You Need: ByT5-Sanskrit, a Unified Model for Sanskrit NLP Tasks [26.848664285007022]
ByT5-Sanskrit is designed for NLP applications involving the morphologically rich language Sanskrit.
It is easier to deploy and more robust to data not covered by external linguistic resources.
We show that our approach yields new best scores for lemmatization and dependency parsing of other morphologically rich languages.
arXiv Detail & Related papers (2024-09-20T22:02:26Z) - Syntax and Semantics Meet in the "Middle": Probing the Syntax-Semantics
Interface of LMs Through Agentivity [68.8204255655161]
We present the semantic notion of agentivity as a case study for probing such interactions.
This suggests LMs may potentially serve as more useful tools for linguistic annotation, theory testing, and discovery.
arXiv Detail & Related papers (2023-05-29T16:24:01Z) - ParaAMR: A Large-Scale Syntactically Diverse Paraphrase Dataset by AMR
Back-Translation [59.91139600152296]
ParaAMR is a large-scale syntactically diverse paraphrase dataset created by abstract meaning representation back-translation.
We show that ParaAMR can be used to improve on three NLP tasks: learning sentence embeddings, syntactically controlled paraphrase generation, and data augmentation for few-shot learning.
arXiv Detail & Related papers (2023-05-26T02:27:33Z) - Romanization-based Large-scale Adaptation of Multilingual Language
Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP.
We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages.
Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z) - Training Effective Neural Sentence Encoders from Automatically Mined
Paraphrases [0.0]
We propose a method for training effective language-specific sentence encoders without manually labeled data.
Our approach is to automatically construct a dataset of paraphrase pairs from sentence-aligned bilingual text corpora.
Our sentence encoder can be trained in less than a day on a single graphics card, achieving high performance on a diverse set of sentence-level tasks.
arXiv Detail & Related papers (2022-07-26T09:08:56Z) - Constrained Language Models Yield Few-Shot Semantic Parsers [73.50960967598654]
We explore the use of large pretrained language models as few-shot semantics.
The goal in semantic parsing is to generate a structured meaning representation given a natural language input.
We use language models to paraphrase inputs into a controlled sublanguage resembling English that can be automatically mapped to a target meaning representation.
arXiv Detail & Related papers (2021-04-18T08:13:06Z) - Experimental Evaluation of Deep Learning models for Marathi Text
Classification [0.0]
We evaluate CNN, LSTM, ULMFiT, and BERT based models on two publicly available Marathi text classification datasets.
We show that basic single layer models based on CNN and LSTM coupled with FastText embeddings perform on par with the BERT based models on the available datasets.
arXiv Detail & Related papers (2021-01-13T06:21:27Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - BURT: BERT-inspired Universal Representation from Twin Structure [89.82415322763475]
BURT (BERT inspired Universal Representation from Twin Structure) is capable of generating universal, fixed-size representations for input sequences of any granularity.
Our proposed BURT adopts the Siamese network, learning sentence-level representations from natural language inference dataset and word/phrase-level representations from paraphrasing dataset.
We evaluate BURT across different granularities of text similarity tasks, including STS tasks, SemEval2013 Task 5(a) and some commonly used word similarity tasks.
arXiv Detail & Related papers (2020-04-29T04:01:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.