Improving Vietnamese Legal Document Retrieval using Synthetic Data
- URL: http://arxiv.org/abs/2412.00657v1
- Date: Sun, 01 Dec 2024 03:28:26 GMT
- Title: Improving Vietnamese Legal Document Retrieval using Synthetic Data
- Authors: Son Pham Tien, Hieu Nguyen Doan, An Nguyen Dai, Sang Dinh Viet,
- Abstract summary: The scarcity of large annotated datasets poses a significant challenge, particularly for Vietnamese legal texts.<n>We propose a novel approach that leverages large language models to generate high-quality, diverse synthetic queries for Vietnamese legal passages.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the field of legal information retrieval, effective embedding-based models are essential for accurate question-answering systems. However, the scarcity of large annotated datasets poses a significant challenge, particularly for Vietnamese legal texts. To address this issue, we propose a novel approach that leverages large language models to generate high-quality, diverse synthetic queries for Vietnamese legal passages. This synthetic data is then used to pre-train retrieval models, specifically bi-encoder and ColBERT, which are further fine-tuned using contrastive loss with mined hard negatives. Our experiments demonstrate that these enhancements lead to strong improvement in retrieval accuracy, validating the effectiveness of synthetic data and pre-training techniques in overcoming the limitations posed by the lack of large labeled datasets in the Vietnamese legal domain.
Related papers
- Less is More: Adaptive Coverage for Synthetic Training Data [20.136698279893857]
This study introduces a novel sampling algorithm, based on the maximum coverage problem, to select a representative subset from a synthetically generated dataset.
Our results demonstrate that training a classifier on this contextually sampled subset achieves superior performance compared to training on the entire dataset.
arXiv Detail & Related papers (2025-04-20T06:45:16Z) - Little Giants: Synthesizing High-Quality Embedding Data at Scale [71.352883755806]
We introduce SPEED, a framework that aligns open-source small models to efficiently generate large-scale embedding data.
SPEED uses only less than 1/10 of the GPT API calls, outperforming the state-of-the-art embedding model E5_mistral when both are trained solely on their synthetic data.
arXiv Detail & Related papers (2024-10-24T10:47:30Z) - Enhancing Legal Case Retrieval via Scaling High-quality Synthetic Query-Candidate Pairs [67.54302101989542]
Legal case retrieval aims to provide similar cases as references for a given fact description.
Existing works mainly focus on case-to-case retrieval using lengthy queries.
Data scale is insufficient to satisfy the training requirements of existing data-hungry neural models.
arXiv Detail & Related papers (2024-10-09T06:26:39Z) - Vietnamese Legal Information Retrieval in Question-Answering System [0.0]
Retrieval Augmented Generation (RAG) has gained significant recognition for enhancing the capabilities of large language models (LLMs)
However, RAG often fall short when applied to the Vietnamese language due to several challenges.
This report introduces our three main modifications taken to address these challenges.
arXiv Detail & Related papers (2024-09-05T02:34:05Z) - SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models [54.78329741186446]
We propose a novel paradigm that uses a code-based critic model to guide steps including question-code data construction, quality control, and complementary evaluation.
Experiments across both in-domain and out-of-domain benchmarks in English and Chinese demonstrate the effectiveness of the proposed paradigm.
arXiv Detail & Related papers (2024-08-28T06:33:03Z) - Improving Grammatical Error Correction via Contextual Data Augmentation [49.746484518527716]
We propose a synthetic data construction method based on contextual augmentation.
Specifically, we combine rule-based substitution with model-based generation.
We also propose a relabeling-based data cleaning method to mitigate the effects of noisy labels in synthetic data.
arXiv Detail & Related papers (2024-06-25T10:49:56Z) - Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs)
Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws.
Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z) - Best Practices and Lessons Learned on Synthetic Data [83.63271573197026]
The success of AI models relies on the availability of large, diverse, and high-quality datasets.
Synthetic data has emerged as a promising solution by generating artificial data that mimics real-world patterns.
arXiv Detail & Related papers (2024-04-11T06:34:17Z) - Multi-dimensional data refining strategy for effective fine-tuning LLMs [2.67766280323297]
This paper presents lessons learned while crawling and refining data tailored for fine-tuning Vietnamese language models.
Our paper presents a multidimensional strategy including leveraging existing datasets in the English language and developing customized data-crawling scripts with the assistance of generative AI tools.
arXiv Detail & Related papers (2023-11-02T07:50:43Z) - Does Synthetic Data Make Large Language Models More Efficient? [0.0]
This paper explores the nuances of synthetic data generation in NLP.
We highlight its advantages, including data augmentation potential and the introduction of structured variety.
We demonstrate the impact of template-based synthetic data on the performance of modern transformer models.
arXiv Detail & Related papers (2023-10-11T19:16:09Z) - Improving Vietnamese Legal Question--Answering System based on Automatic
Data Enrichment [2.56085064991751]
In this paper, we try to overcome these limitations by implementing a Vietnamese article-level retrieval-based legal QA system.
Our hypothesis is that in contexts where labeled data are limited, efficient data enrichment can help increase overall performance.
arXiv Detail & Related papers (2023-06-08T00:24:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.