A Large-Scale Benchmark for Vietnamese Sentence Paraphrases
- URL: http://arxiv.org/abs/2502.07188v1
- Date: Tue, 11 Feb 2025 02:30:21 GMT
- Title: A Large-Scale Benchmark for Vietnamese Sentence Paraphrases
- Authors: Sang Quang Nguyen, Kiet Van Nguyen,
- Abstract summary: This paper presents ViSP, a high-quality Vietnamese dataset for sentence paraphrasing, consisting of 1.2M original-paraphrase pairs.
To the best of our knowledge, this is the first large-scale study on Vietnamese paraphrasing.
- Score: 1.1842520528140819
- License:
- Abstract: This paper presents ViSP, a high-quality Vietnamese dataset for sentence paraphrasing, consisting of 1.2M original-paraphrase pairs collected from various domains. The dataset was constructed using a hybrid approach that combines automatic paraphrase generation with manual evaluation to ensure high quality. We conducted experiments using methods such as back-translation, EDA, and baseline models like BART and T5, as well as large language models (LLMs), including GPT-4o, Gemini-1.5, Aya, Qwen-2.5, and Meta-Llama-3.1 variants. To the best of our knowledge, this is the first large-scale study on Vietnamese paraphrasing. We hope that our dataset and findings will serve as a valuable foundation for future research and applications in Vietnamese paraphrase tasks.
Related papers
- Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features [57.34477506004105]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation.
We introduce novel methodologies and datasets to overcome these challenges.
We propose MhBART, an encoder-decoder model designed to emulate human writing style.
We also propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features.
arXiv Detail & Related papers (2024-12-17T08:47:41Z) - Vintern-1B: An Efficient Multimodal Large Language Model for Vietnamese [0.0]
Vintern-1B is a reliable multimodal large language model (MLLM) for Vietnamese language tasks.
The model is fine-tuned on an extensive dataset of over 3 million image-question-answer pairs.
Vintern-1B is small enough to fit into various on-device applications easily.
arXiv Detail & Related papers (2024-08-22T15:15:51Z) - Navigating Text-to-Image Generative Bias across Indic Languages [53.92640848303192]
This research investigates biases in text-to-image (TTI) models for the Indic languages widely spoken across India.
It evaluates and compares the generative performance and cultural relevance of leading TTI models in these languages against their performance in English.
arXiv Detail & Related papers (2024-08-01T04:56:13Z) - Cross-lingual Contextualized Phrase Retrieval [63.80154430930898]
We propose a new task formulation of dense retrieval, cross-lingual contextualized phrase retrieval.
We train our Cross-lingual Contextualized Phrase Retriever (CCPR) using contrastive learning.
On the phrase retrieval task, CCPR surpasses baselines by a significant margin, achieving a top-1 accuracy that is at least 13 points higher.
arXiv Detail & Related papers (2024-03-25T14:46:51Z) - LyricSIM: A novel Dataset and Benchmark for Similarity Detection in
Spanish Song LyricS [52.77024349608834]
We present a new dataset and benchmark tailored to the task of semantic similarity in song lyrics.
Our dataset, originally consisting of 2775 pairs of Spanish songs, was annotated in a collective annotation experiment by 63 native annotators.
arXiv Detail & Related papers (2023-06-02T07:48:20Z) - MTet: Multi-domain Translation for English and Vietnamese [10.126442202316825]
MTet is the largest publicly available parallel corpus for English-Vietnamese translation.
We release the first pretrained model EnViT5 for English and Vietnamese languages.
arXiv Detail & Related papers (2022-10-11T16:55:21Z) - EAG: Extract and Generate Multi-way Aligned Corpus for Complete Multi-lingual Neural Machine Translation [63.88541605363555]
"Extract and Generate" (EAG) is a two-step approach to construct large-scale and high-quality multi-way aligned corpus from bilingual data.
We first extract candidate aligned examples by pairing the bilingual examples from different language pairs with highly similar source or target sentences.
We then generate the final aligned examples from the candidates with a well-trained generation model.
arXiv Detail & Related papers (2022-03-04T08:21:27Z) - PhoMT: A High-Quality and Large-Scale Benchmark Dataset for
Vietnamese-English Machine Translation [6.950742601378329]
We introduce a high-quality and large-scale Vietnamese-English parallel dataset of 3.02M sentence pairs.
This is 2.9M pairs larger than the benchmark Vietnamese-English machine translation corpus IWSLT15.
In both automatic and human evaluations, the best performance is obtained by fine-tuning the pre-trained sequence-to-sequence denoising auto-encoder mBART.
arXiv Detail & Related papers (2021-10-23T11:42:01Z) - Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval [51.004601358498135]
Mr. TyDi is a benchmark dataset for mono-lingual retrieval in eleven typologically diverse languages.
The goal of this resource is to spur research in dense retrieval techniques in non-English languages.
arXiv Detail & Related papers (2021-08-19T16:53:43Z) - A Pilot Study of Text-to-SQL Semantic Parsing for Vietnamese [11.782566169354725]
We present the first public large-scale Text-to-resource semantic parsing dataset for Vietnamese.
We find that automatic Vietnamese word segmentation improves the parsing results of both baselines.
PhoBERT for Vietnamese helps produce higher performances than the recent best multilingual language model XLM-R.
arXiv Detail & Related papers (2020-10-05T09:54:51Z) - A Vietnamese Dataset for Evaluating Machine Reading Comprehension [2.7528170226206443]
We present UIT-ViQuAD, a new dataset for the low-resource language as Vietnamese to evaluate machine reading comprehension models.
This dataset comprises over 23,000 human-generated question-answer pairs based on 5,109 passages of 174 Vietnamese articles from Wikipedia.
We conduct experiments on state-of-the-art MRC methods for English and Chinese as the first experimental models on UIT-ViQuAD.
arXiv Detail & Related papers (2020-09-30T15:06:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.