ParaFusion: A Large-Scale LLM-Driven English Paraphrase Dataset Infused with High-Quality Lexical and Syntactic Diversity
- URL: http://arxiv.org/abs/2404.12010v1
- Date: Thu, 18 Apr 2024 09:02:45 GMT
- Title: ParaFusion: A Large-Scale LLM-Driven English Paraphrase Dataset Infused with High-Quality Lexical and Syntactic Diversity
- Authors: Lasal Jayawardena, Prasan Yapa,
- Abstract summary: Existing datasets in the domain lack syntactic and lexical diversity, resulting in paraphrases that closely resemble the source sentences.
This research introduces ParaFusion, a large-scale, high-quality English paraphrase dataset developed using Large Language Models (LLM)
ParaFusion augments existing datasets with high-quality data, significantly enhancing both lexical and syntactic diversity while maintaining close semantic similarity.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Paraphrase generation is a pivotal task in natural language processing (NLP). Existing datasets in the domain lack syntactic and lexical diversity, resulting in paraphrases that closely resemble the source sentences. Moreover, these datasets often contain hate speech and noise, and may unintentionally include non-English language sentences. This research introduces ParaFusion, a large-scale, high-quality English paraphrase dataset developed using Large Language Models (LLM) to address these challenges. ParaFusion augments existing datasets with high-quality data, significantly enhancing both lexical and syntactic diversity while maintaining close semantic similarity. It also mitigates the presence of hate speech and reduces noise, ensuring a cleaner and more focused English dataset. Results show that ParaFusion offers at least a 25% improvement in both syntactic and lexical diversity, measured across several metrics for each data source. The paper also aims to set a gold standard for paraphrase evaluation as it contains one of the most comprehensive evaluation strategies to date. The results underscore the potential of ParaFusion as a valuable resource for improving NLP applications.
Related papers
- GlotLID: Language Identification for Low-Resource Languages [51.38634652914054]
GlotLID-M is an LID model that satisfies the desiderata of wide coverage, reliability and efficiency.
It identifies 1665 languages, a large increase in coverage compared to prior work.
arXiv Detail & Related papers (2023-10-24T23:45:57Z) - mCL-NER: Cross-Lingual Named Entity Recognition via Multi-view
Contrastive Learning [54.523172171533645]
Cross-lingual named entity recognition (CrossNER) faces challenges stemming from uneven performance due to the scarcity of multilingual corpora.
We propose Multi-view Contrastive Learning for Cross-lingual Named Entity Recognition (mCL-NER)
Our experiments on the XTREME benchmark, spanning 40 languages, demonstrate the superiority of mCL-NER over prior data-driven and model-based approaches.
arXiv Detail & Related papers (2023-08-17T16:02:29Z) - Improving Domain-Specific Retrieval by NLI Fine-Tuning [64.79760042717822]
This article investigates the fine-tuning potential of natural language inference (NLI) data to improve information retrieval and ranking.
We employ both monolingual and multilingual sentence encoders fine-tuned by a supervised method utilizing contrastive loss and NLI data.
Our results point to the fact that NLI fine-tuning increases the performance of the models in both tasks and both languages, with the potential to improve mono- and multilingual models.
arXiv Detail & Related papers (2023-08-06T12:40:58Z) - Making Metadata More FAIR Using Large Language Models [2.61630828688114]
This work presents a Natural Language Processing (NLP) informed application, called FAIRMetaText, that compares metadata.
Specifically, FAIRMetaText analyzes the natural language descriptions of metadata and provides a mathematical similarity measure between two terms.
This software can drastically reduce the human effort in sifting through various natural language metadata while employing several experimental datasets on the same topic.
arXiv Detail & Related papers (2023-07-24T19:14:38Z) - ParaAMR: A Large-Scale Syntactically Diverse Paraphrase Dataset by AMR
Back-Translation [59.91139600152296]
ParaAMR is a large-scale syntactically diverse paraphrase dataset created by abstract meaning representation back-translation.
We show that ParaAMR can be used to improve on three NLP tasks: learning sentence embeddings, syntactically controlled paraphrase generation, and data augmentation for few-shot learning.
arXiv Detail & Related papers (2023-05-26T02:27:33Z) - Language Agnostic Code-Mixing Data Augmentation by Predicting Linguistic
Patterns [0.5560631344057825]
We propose several Synthetic Code-Mixing (SCM) data augmentation methods that outperform the baseline on downstream sentiment analysis tasks.
Our proposed methods demonstrate that strategically replacing parts of sentences in the matrix language with a constant mask significantly improves classification accuracy.
We test our data augmentation method in a variety of low-resource and cross-lingual settings, reaching up to a relative improvement of 7.73% on the extremely scarce English-Malayalam dataset.
arXiv Detail & Related papers (2022-11-14T18:50:16Z) - Always Keep your Target in Mind: Studying Semantics and Improving
Performance of Neural Lexical Substitution [124.99894592871385]
We present a large-scale comparative study of lexical substitution methods employing both old and most recent language models.
We show that already competitive results achieved by SOTA LMs/MLMs can be further substantially improved if information about the target word is injected properly.
arXiv Detail & Related papers (2022-06-07T16:16:19Z) - WANLI: Worker and AI Collaboration for Natural Language Inference
Dataset Creation [101.00109827301235]
We introduce a novel paradigm for dataset creation based on human and machine collaboration.
We use dataset cartography to automatically identify examples that demonstrate challenging reasoning patterns, and instruct GPT-3 to compose new examples with similar patterns.
The resulting dataset, WANLI, consists of 108,357 natural language inference (NLI) examples that present unique empirical strengths.
arXiv Detail & Related papers (2022-01-16T03:13:49Z) - To Augment or Not to Augment? A Comparative Study on Text Augmentation
Techniques for Low-Resource NLP [0.0]
We investigate three categories of text augmentation methodologies which perform changes on the syntax.
We compare them on part-of-speech tagging, dependency parsing and semantic role labeling for a diverse set of language families.
Our results suggest that the augmentation techniques can further improve over strong baselines based on mBERT.
arXiv Detail & Related papers (2021-11-18T10:52:48Z) - Extracting and filtering paraphrases by bridging natural language
inference and paraphrasing [0.0]
We propose a novel methodology for the extraction of paraphrasing datasets from NLI datasets and cleaning existing paraphrasing datasets.
The results show high quality of extracted paraphrasing datasets and surprisingly high noise levels in two existing paraphrasing datasets.
arXiv Detail & Related papers (2021-11-13T14:06:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.