Improving Vietnamese Legal Question--Answering System based on Automatic
Data Enrichment
- URL: http://arxiv.org/abs/2306.04841v1
- Date: Thu, 8 Jun 2023 00:24:29 GMT
- Title: Improving Vietnamese Legal Question--Answering System based on Automatic
Data Enrichment
- Authors: Thi-Hai-Yen Vuong, Ha-Thanh Nguyen, Quang-Huy Nguyen, Le-Minh Nguyen,
and Xuan-Hieu Phan
- Abstract summary: In this paper, we try to overcome these limitations by implementing a Vietnamese article-level retrieval-based legal QA system.
Our hypothesis is that in contexts where labeled data are limited, efficient data enrichment can help increase overall performance.
- Score: 2.56085064991751
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Question answering (QA) in law is a challenging problem because legal
documents are much more complicated than normal texts in terms of terminology,
structure, and temporal and logical relationships. It is even more difficult to
perform legal QA for low-resource languages like Vietnamese where labeled data
are rare and pre-trained language models are still limited. In this paper, we
try to overcome these limitations by implementing a Vietnamese article-level
retrieval-based legal QA system and introduce a novel method to improve the
performance of language models by improving data quality through weak labeling.
Our hypothesis is that in contexts where labeled data are limited, efficient
data enrichment can help increase overall performance. Our experiments are
designed to test multiple aspects, which demonstrate the effectiveness of the
proposed technique.
Related papers
- CompAct: Compressing Retrieved Documents Actively for Question Answering [15.585833125854418]
CompAct is a novel framework that employs an active strategy to condense extensive documents without losing key information.
Our experiments demonstrate that CompAct brings significant improvements in both performance and compression rate on multi-hop question-answering benchmarks.
arXiv Detail & Related papers (2024-07-12T06:06:54Z) - Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph [85.51252685938564]
Uncertainty quantification (UQ) is becoming increasingly recognized as a critical component of applications that rely on machine learning (ML)
As with other ML models, large language models (LLMs) are prone to make incorrect predictions, hallucinate'' by fabricating claims, or simply generate low-quality output for a given input.
We introduce a novel benchmark that implements a collection of state-of-the-art UQ baselines, and provides an environment for controllable and consistent evaluation of novel techniques.
arXiv Detail & Related papers (2024-06-21T20:06:31Z) - Empowering Prior to Court Legal Analysis: A Transparent and Accessible Dataset for Defensive Statement Classification and Interpretation [5.646219481667151]
This paper introduces a novel dataset tailored for classification of statements made during police interviews, prior to court proceedings.
We introduce a fine-tuned DistilBERT model that achieves state-of-the-art performance in distinguishing truthful from deceptive statements.
We also present an XAI interface that empowers both legal professionals and non-specialists to interact with and benefit from our system.
arXiv Detail & Related papers (2024-05-17T11:22:27Z) - The Power of Question Translation Training in Multilingual Reasoning: Broadened Scope and Deepened Insights [108.40766216456413]
We propose a question alignment approach to bridge the gap between large language models' English and non-English performance.
Experiment results show that the question alignment approach can be used to boost multilingual performance across diverse reasoning scenarios.
To understand the mechanism of its success, we analyze representation space, chain-of-thought and translation data scales.
arXiv Detail & Related papers (2024-05-02T14:49:50Z) - InfoLossQA: Characterizing and Recovering Information Loss in Text Simplification [60.10193972862099]
This work proposes a framework to characterize and recover simplification-induced information loss in form of question-and-answer pairs.
QA pairs are designed to help readers deepen their knowledge of a text.
arXiv Detail & Related papers (2024-01-29T19:00:01Z) - Interpretable Long-Form Legal Question Answering with
Retrieval-Augmented Large Language Models [10.834755282333589]
Long-form Legal Question Answering dataset comprises 1,868 expert-annotated legal questions in the French language.
Our experimental results demonstrate promising performance on automatic evaluation metrics.
As one of the only comprehensive, expert-annotated long-form LQA dataset, LLeQA has the potential to not only accelerate research towards resolving a significant real-world issue, but also act as a rigorous benchmark for evaluating NLP models in specialized domains.
arXiv Detail & Related papers (2023-09-29T08:23:19Z) - Attentive Deep Neural Networks for Legal Document Retrieval [2.4350217735794337]
We study the use of attentive neural network-based text representation for statute law document retrieval.
We develop two hierarchical architectures with sparse attention to represent long sentences and articles, and we name them Attentive CNN and Paraformer.
Experimental results show that Attentive neural methods substantially outperform non-neural methods in terms of retrieval performance across datasets and languages.
arXiv Detail & Related papers (2022-12-13T01:37:27Z) - Towards Complex Document Understanding By Discrete Reasoning [77.91722463958743]
Document Visual Question Answering (VQA) aims to understand visually-rich documents to answer questions in natural language.
We introduce a new Document VQA dataset, named TAT-DQA, which consists of 3,067 document pages and 16,558 question-answer pairs.
We develop a novel model named MHST that takes into account the information in multi-modalities, including text, layout and visual image, to intelligently address different types of questions.
arXiv Detail & Related papers (2022-07-25T01:43:19Z) - Efficient Entity Candidate Generation for Low-Resource Languages [13.789451365205665]
Candidate generation is a crucial module in entity linking.
It plays a key role in multiple NLP tasks that have been proven to beneficially leverage knowledge bases.
This paper constitutes an in-depth analysis of the candidate generation problem in the context of cross-lingual entity linking.
arXiv Detail & Related papers (2022-06-30T09:49:53Z) - When Does Translation Require Context? A Data-driven, Multilingual
Exploration [71.43817945875433]
proper handling of discourse significantly contributes to the quality of machine translation (MT)
Recent works in context-aware MT attempt to target a small set of discourse phenomena during evaluation.
We develop the Multilingual Discourse-Aware benchmark, a series of taggers that identify and evaluate model performance on discourse phenomena.
arXiv Detail & Related papers (2021-09-15T17:29:30Z) - Conditioned Text Generation with Transfer for Closed-Domain Dialogue
Systems [65.48663492703557]
We show how to optimally train and control the generation of intent-specific sentences using a conditional variational autoencoder.
We introduce a new protocol called query transfer that allows to leverage a large unlabelled dataset.
arXiv Detail & Related papers (2020-11-03T14:06:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.