Improving Vietnamese Legal Question--Answering System based on Automatic
Data Enrichment
- URL: http://arxiv.org/abs/2306.04841v1
- Date: Thu, 8 Jun 2023 00:24:29 GMT
- Title: Improving Vietnamese Legal Question--Answering System based on Automatic
Data Enrichment
- Authors: Thi-Hai-Yen Vuong, Ha-Thanh Nguyen, Quang-Huy Nguyen, Le-Minh Nguyen,
and Xuan-Hieu Phan
- Abstract summary: In this paper, we try to overcome these limitations by implementing a Vietnamese article-level retrieval-based legal QA system.
Our hypothesis is that in contexts where labeled data are limited, efficient data enrichment can help increase overall performance.
- Score: 2.56085064991751
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Question answering (QA) in law is a challenging problem because legal
documents are much more complicated than normal texts in terms of terminology,
structure, and temporal and logical relationships. It is even more difficult to
perform legal QA for low-resource languages like Vietnamese where labeled data
are rare and pre-trained language models are still limited. In this paper, we
try to overcome these limitations by implementing a Vietnamese article-level
retrieval-based legal QA system and introduce a novel method to improve the
performance of language models by improving data quality through weak labeling.
Our hypothesis is that in contexts where labeled data are limited, efficient
data enrichment can help increase overall performance. Our experiments are
designed to test multiple aspects, which demonstrate the effectiveness of the
proposed technique.
Related papers
- VLQA: The First Comprehensive, Large, and High-Quality Vietnamese Dataset for Legal Question Answering [4.546567493379192]
We introduce the VLQA dataset, a comprehensive and high-quality resource tailored for the Vietnamese legal domain.<n>We also conduct a comprehensive statistical analysis of the dataset and evaluate its effectiveness.
arXiv Detail & Related papers (2025-07-26T16:26:50Z) - ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models [67.75439511654078]
Large Vision-Language Models (LVLMs) have introduced a new paradigm for understanding and reasoning about image input through textual responses.<n>They face the persistent challenge of hallucination, which introduces practical weaknesses and raises concerns about their reliable deployment in real-world applications.<n>We propose ONLY, a training-free decoding approach that requires only a single query and a one-layer intervention during decoding, enabling efficient real-time deployment.
arXiv Detail & Related papers (2025-07-01T16:01:08Z) - QA-prompting: Improving Summarization with Large Language Models using Question-Answering [0.0]
Language Models (LMs) have revolutionized natural language processing, enabling high-quality text generation through prompting and in-context learning.<n>We propose QA-prompting - a simple prompting method for summarization that utilizes question-answering as an intermediate step prior to summary generation.<n>Our method extracts key information and enriches the context of text to mitigate positional biases and improve summarization in a single LM call per task without requiring fine-tuning or pipelining.
arXiv Detail & Related papers (2025-05-20T13:29:36Z) - Do LLMs Understand Your Translations? Evaluating Paragraph-level MT with Question Answering [68.3400058037817]
We introduce TREQA (Translation Evaluation via Question-Answering), a framework that extrinsically evaluates translation quality.
We show that TREQA is competitive with and, in some cases, outperforms state-of-the-art neural and LLM-based metrics in ranking alternative paragraph-level translations.
arXiv Detail & Related papers (2025-04-10T09:24:54Z) - Enhancing Vietnamese VQA through Curriculum Learning on Raw and Augmented Text Representations [3.735112400244042]
Visual Question Answering (VQA) is a multimodal task requiring reasoning across textual and visual inputs.
Traditional methods often rely heavily on extensive annotated datasets, computationally expensive pipelines, and large pre-trained models.
We propose a training framework that combines a paraphrase-based feature augmentation module with a dynamic curriculum learning strategy.
arXiv Detail & Related papers (2025-03-05T09:12:16Z) - Lean-ing on Quality: How High-Quality Data Beats Diverse Multilingual Data in AutoFormalization [1.204553980682492]
We introduce a novel methodology that leverages back-translation with hand-curated prompts to enhance the mathematical capabilities of language models.
We show that our approaches surpass the performance of fine-tuning with extensive multilingual datasets.
Taken together, our methods show a promising new approach to significantly reduce the resources required to formalize, thereby accelerating AI for math.
arXiv Detail & Related papers (2025-02-18T19:16:54Z) - Improving Vietnamese Legal Document Retrieval using Synthetic Data [0.0]
The scarcity of large annotated datasets poses a significant challenge, particularly for Vietnamese legal texts.
We propose a novel approach that leverages large language models to generate high-quality, diverse synthetic queries for Vietnamese legal passages.
arXiv Detail & Related papers (2024-12-01T03:28:26Z) - Likelihood as a Performance Gauge for Retrieval-Augmented Generation [78.28197013467157]
We show that likelihoods serve as an effective gauge for language model performance.
We propose two methods that use question likelihood as a gauge for selecting and constructing prompts that lead to better performance.
arXiv Detail & Related papers (2024-11-12T13:14:09Z) - Vietnamese Legal Information Retrieval in Question-Answering System [0.0]
Retrieval Augmented Generation (RAG) has gained significant recognition for enhancing the capabilities of large language models (LLMs)
However, RAG often fall short when applied to the Vietnamese language due to several challenges.
This report introduces our three main modifications taken to address these challenges.
arXiv Detail & Related papers (2024-09-05T02:34:05Z) - Empowering Prior to Court Legal Analysis: A Transparent and Accessible Dataset for Defensive Statement Classification and Interpretation [5.646219481667151]
This paper introduces a novel dataset tailored for classification of statements made during police interviews, prior to court proceedings.
We introduce a fine-tuned DistilBERT model that achieves state-of-the-art performance in distinguishing truthful from deceptive statements.
We also present an XAI interface that empowers both legal professionals and non-specialists to interact with and benefit from our system.
arXiv Detail & Related papers (2024-05-17T11:22:27Z) - The Power of Question Translation Training in Multilingual Reasoning: Broadened Scope and Deepened Insights [108.40766216456413]
We propose a question alignment framework to bridge the gap between large language models' English and non-English performance.
Experiment results show it can boost multilingual performance across diverse reasoning scenarios, model families, and sizes.
We analyze representation space, generated response and data scales, and reveal how question translation training strengthens language alignment within LLMs.
arXiv Detail & Related papers (2024-05-02T14:49:50Z) - InfoLossQA: Characterizing and Recovering Information Loss in Text Simplification [60.10193972862099]
This work proposes a framework to characterize and recover simplification-induced information loss in form of question-and-answer pairs.
QA pairs are designed to help readers deepen their knowledge of a text.
arXiv Detail & Related papers (2024-01-29T19:00:01Z) - Interpretable Long-Form Legal Question Answering with
Retrieval-Augmented Large Language Models [10.834755282333589]
Long-form Legal Question Answering dataset comprises 1,868 expert-annotated legal questions in the French language.
Our experimental results demonstrate promising performance on automatic evaluation metrics.
As one of the only comprehensive, expert-annotated long-form LQA dataset, LLeQA has the potential to not only accelerate research towards resolving a significant real-world issue, but also act as a rigorous benchmark for evaluating NLP models in specialized domains.
arXiv Detail & Related papers (2023-09-29T08:23:19Z) - Attentive Deep Neural Networks for Legal Document Retrieval [2.4350217735794337]
We study the use of attentive neural network-based text representation for statute law document retrieval.
We develop two hierarchical architectures with sparse attention to represent long sentences and articles, and we name them Attentive CNN and Paraformer.
Experimental results show that Attentive neural methods substantially outperform non-neural methods in terms of retrieval performance across datasets and languages.
arXiv Detail & Related papers (2022-12-13T01:37:27Z) - Towards Complex Document Understanding By Discrete Reasoning [77.91722463958743]
Document Visual Question Answering (VQA) aims to understand visually-rich documents to answer questions in natural language.
We introduce a new Document VQA dataset, named TAT-DQA, which consists of 3,067 document pages and 16,558 question-answer pairs.
We develop a novel model named MHST that takes into account the information in multi-modalities, including text, layout and visual image, to intelligently address different types of questions.
arXiv Detail & Related papers (2022-07-25T01:43:19Z) - Efficient Entity Candidate Generation for Low-Resource Languages [13.789451365205665]
Candidate generation is a crucial module in entity linking.
It plays a key role in multiple NLP tasks that have been proven to beneficially leverage knowledge bases.
This paper constitutes an in-depth analysis of the candidate generation problem in the context of cross-lingual entity linking.
arXiv Detail & Related papers (2022-06-30T09:49:53Z) - When Does Translation Require Context? A Data-driven, Multilingual
Exploration [71.43817945875433]
proper handling of discourse significantly contributes to the quality of machine translation (MT)
Recent works in context-aware MT attempt to target a small set of discourse phenomena during evaluation.
We develop the Multilingual Discourse-Aware benchmark, a series of taggers that identify and evaluate model performance on discourse phenomena.
arXiv Detail & Related papers (2021-09-15T17:29:30Z) - Conditioned Text Generation with Transfer for Closed-Domain Dialogue
Systems [65.48663492703557]
We show how to optimally train and control the generation of intent-specific sentences using a conditional variational autoencoder.
We introduce a new protocol called query transfer that allows to leverage a large unlabelled dataset.
arXiv Detail & Related papers (2020-11-03T14:06:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.