VLQA: The First Comprehensive, Large, and High-Quality Vietnamese Dataset for Legal Question Answering
- URL: http://arxiv.org/abs/2507.19995v1
- Date: Sat, 26 Jul 2025 16:26:50 GMT
- Title: VLQA: The First Comprehensive, Large, and High-Quality Vietnamese Dataset for Legal Question Answering
- Authors: Tan-Minh Nguyen, Hoang-Trung Nguyen, Trong-Khoi Dao, Xuan-Hieu Phan, Ha-Thanh Nguyen, Thi-Hai-Yen Vuong,
- Abstract summary: We introduce the VLQA dataset, a comprehensive and high-quality resource tailored for the Vietnamese legal domain.<n>We also conduct a comprehensive statistical analysis of the dataset and evaluate its effectiveness.
- Score: 4.546567493379192
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The advent of large language models (LLMs) has led to significant achievements in various domains, including legal text processing. Leveraging LLMs for legal tasks is a natural evolution and an increasingly compelling choice. However, their capabilities are often portrayed as greater than they truly are. Despite the progress, we are still far from the ultimate goal of fully automating legal tasks using artificial intelligence (AI) and natural language processing (NLP). Moreover, legal systems are deeply domain-specific and exhibit substantial variation across different countries and languages. The need for building legal text processing applications for different natural languages is, therefore, large and urgent. However, there is a big challenge for legal NLP in low-resource languages such as Vietnamese due to the scarcity of resources and annotated data. The need for labeled legal corpora for supervised training, validation, and supervised fine-tuning is critical. In this paper, we introduce the VLQA dataset, a comprehensive and high-quality resource tailored for the Vietnamese legal domain. We also conduct a comprehensive statistical analysis of the dataset and evaluate its effectiveness through experiments with state-of-the-art models on legal information retrieval and question-answering tasks.
Related papers
- Large Language Models in Legislative Content Analysis: A Dataset from the Polish Parliament [0.0]
The research contributes to the advancement of NLP in the legal field, particularly in the Polish language.<n>It has been demonstrated that even commonly accessible data can be practically utilized for legislative content analysis.
arXiv Detail & Related papers (2025-03-15T12:10:20Z) - LegalAgentBench: Evaluating LLM Agents in Legal Domain [53.70993264644004]
LegalAgentBench is a benchmark specifically designed to evaluate LLM Agents in the Chinese legal domain.<n>LegalAgentBench includes 17 corpora from real-world legal scenarios and provides 37 tools for interacting with external knowledge.
arXiv Detail & Related papers (2024-12-23T04:02:46Z) - Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges [4.548047308860141]
This survey follows the Preferred Reporting Items for Systematic Reviews and Meta-Analyses framework, reviewing 154 studies, with a final selection of 133 after manual filtering.<n>It explores foundational concepts related to NLP in the legal domain, illustrating the unique aspects and challenges of processing legal texts.<n>We provide an overview of NLP tasks specific to legal text, such as Legal Document Summarisation, legal Named Entity Recognition, Legal Question Answering, Legal Argument Mining, Legal Text Classification, and Legal Judgement Prediction.
arXiv Detail & Related papers (2024-10-25T01:17:02Z) - InternLM-Law: An Open Source Chinese Legal Large Language Model [72.2589401309848]
InternLM-Law is a specialized LLM tailored for addressing diverse legal queries related to Chinese laws.
We meticulously construct a dataset in the Chinese legal domain, encompassing over 1 million queries.
InternLM-Law achieves the highest average performance on LawBench, outperforming state-of-the-art models, including GPT-4, on 13 out of 20 subtasks.
arXiv Detail & Related papers (2024-06-21T06:19:03Z) - Exploring the Nexus of Large Language Models and Legal Systems: A Short Survey [1.0770079992809338]
The capabilities of Large Language Models (LLMs) are increasingly demonstrating unique roles in the legal sector.
This survey delves into the synergy between LLMs and the legal system, such as their applications in tasks like legal text comprehension, case retrieval, and analysis.
The survey showcases the latest advancements in fine-tuned legal LLMs tailored for various legal systems, along with legal datasets available for fine-tuning LLMs in various languages.
arXiv Detail & Related papers (2024-04-01T08:35:56Z) - Precedent-Enhanced Legal Judgment Prediction with LLM and Domain-Model
Collaboration [52.57055162778548]
Legal Judgment Prediction (LJP) has become an increasingly crucial task in Legal AI.
Precedents are the previous legal cases with similar facts, which are the basis for the judgment of the subsequent case in national legal systems.
Recent advances in deep learning have enabled a variety of techniques to be used to solve the LJP task.
arXiv Detail & Related papers (2023-10-13T16:47:20Z) - How to Handle Different Types of Out-of-Distribution Scenarios in Computational Argumentation? A Comprehensive and Fine-Grained Field Study [59.13867562744973]
This work systematically assesses LMs' capabilities for out-of-distribution (OOD) scenarios.
We find that the efficacy of such learning paradigms varies with the type of OOD.
Specifically, while ICL excels for domain shifts, prompt-based fine-tuning surpasses for topic shifts.
arXiv Detail & Related papers (2023-09-15T11:15:47Z) - NeCo@ALQAC 2023: Legal Domain Knowledge Acquisition for Low-Resource
Languages through Data Enrichment [2.441072488254427]
This paper presents NeCo Team's solutions to the Vietnamese text processing tasks provided in the Automated Legal Question Answering Competition 2023 (ALQAC 2023)
Our methods for the legal document retrieval task employ a combination of similarity ranking and deep learning models, while for the second task, we propose a range of adaptive techniques to handle different question types.
Our approaches achieve outstanding results on both tasks of the competition, demonstrating the potential benefits and effectiveness of question answering systems in the legal field.
arXiv Detail & Related papers (2023-09-11T14:43:45Z) - Improving Vietnamese Legal Question--Answering System based on Automatic
Data Enrichment [2.56085064991751]
In this paper, we try to overcome these limitations by implementing a Vietnamese article-level retrieval-based legal QA system.
Our hypothesis is that in contexts where labeled data are limited, efficient data enrichment can help increase overall performance.
arXiv Detail & Related papers (2023-06-08T00:24:29Z) - Lawformer: A Pre-trained Language Model for Chinese Legal Long Documents [56.40163943394202]
We release the Longformer-based pre-trained language model, named as Lawformer, for Chinese legal long documents understanding.
We evaluate Lawformer on a variety of LegalAI tasks, including judgment prediction, similar case retrieval, legal reading comprehension, and legal question answering.
arXiv Detail & Related papers (2021-05-09T09:39:25Z) - FDMT: A Benchmark Dataset for Fine-grained Domain Adaptation in Machine
Translation [53.87731008029645]
We present a real-world fine-grained domain adaptation task in machine translation (FDMT)
The FDMT dataset consists of four sub-domains of information technology: autonomous vehicles, AI education, real-time networks and smart phone.
We make quantitative experiments and deep analyses in this new setting, which benchmarks the fine-grained domain adaptation task.
arXiv Detail & Related papers (2020-12-31T17:15:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.