Related papers: Optimizing Legal Document Retrieval in Vietnamese with Semi-Hard Negative Mining

Optimizing Legal Document Retrieval in Vietnamese with Semi-Hard Negative Mining

URL: http://arxiv.org/abs/2507.14619v1
Date: Sat, 19 Jul 2025 13:30:14 GMT
Title: Optimizing Legal Document Retrieval in Vietnamese with Semi-Hard Negative Mining
Authors: Van-Hoang Le, Duc-Vu Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen,
Abstract summary: This paper presents a two-stage framework consisting of Retrieval and Re-ranking to enhance legal document retrieval efficiency and accuracy.<n>Key innovations include the introduction of the Exist@m metric to evaluate retrieval effectiveness and the use of semi-hard negatives to mitigate training bias.<n>The framework demonstrates that optimized data processing, tailored loss functions, and balanced negative sampling are pivotal for building robust retrieval-augmented systems in legal contexts.
Score: 4.233176571117095
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large Language Models (LLMs) face significant challenges in specialized domains like law, where precision and domain-specific knowledge are critical. This paper presents a streamlined two-stage framework consisting of Retrieval and Re-ranking to enhance legal document retrieval efficiency and accuracy. Our approach employs a fine-tuned Bi-Encoder for rapid candidate retrieval, followed by a Cross-Encoder for precise re-ranking, both optimized through strategic negative example mining. Key innovations include the introduction of the Exist@m metric to evaluate retrieval effectiveness and the use of semi-hard negatives to mitigate training bias, which significantly improved re-ranking performance. Evaluated on the SoICT Hackathon 2024 for Legal Document Retrieval, our team, 4Huiter, achieved a top-three position. While top-performing teams employed ensemble models and iterative self-training on large bge-m3 architectures, our lightweight, single-pass approach offered a competitive alternative with far fewer parameters. The framework demonstrates that optimized data processing, tailored loss functions, and balanced negative sampling are pivotal for building robust retrieval-augmented systems in legal contexts.

Related papers

Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs [51.21041884010009]
Ring-lite is a Mixture-of-Experts (MoE)-based large language model optimized via reinforcement learning (RL)<n>Our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks.
arXiv Detail & Related papers (2025-06-17T17:12:34Z)
Learning Binarized Representations with Pseudo-positive Sample Enhancement for Efficient Graph Collaborative Filtering [35.82405808653398]
We study the problem of graph representation binarization for efficient collaborative filtering.<n>Our findings indicate that explicitly mitigating information loss at various stages of embedding binarization has a significant positive impact on performance.<n>Compared to its predecessor BiGeaR, BiGeaR++ introduces a fine-grained inference distillation mechanism and an effective embedding sample synthesis approach.
arXiv Detail & Related papers (2025-06-03T11:11:43Z)
KARE-RAG: Knowledge-Aware Refinement and Enhancement for RAG [63.82127103851471]
Retrieval-Augmented Generation (RAG) enables large language models to access broader knowledge sources.<n>We demonstrate that enhancing generative models' capacity to process noisy content is equally critical for robust performance.<n>We present KARE-RAG, which improves knowledge utilization through three key innovations.
arXiv Detail & Related papers (2025-06-03T06:31:17Z)
Breaking the Lens of the Telescope: Online Relevance Estimation over Large Retrieval Sets [14.494301139974455]
We propose a novel paradigm for re-ranking called online relevance estimation.<n>Online relevance estimation continuously updates relevance estimates for a query throughout the ranking process.<n>We validate our approach on TREC benchmarks under two scenarios: hybrid retrieval and adaptive retrieval.
arXiv Detail & Related papers (2025-04-12T22:05:50Z)
Lightweight and Direct Document Relevance Optimization for Generative Information Retrieval [49.669503570350166]
Generative information retrieval (GenIR) is a promising neural retrieval paradigm that formulates document retrieval as a document identifier (docid) generation task.<n>Existing GenIR models suffer from token-level misalignment, where models trained to predict the next token often fail to capture document-level relevance effectively.<n>We propose direct document relevance optimization (DDRO), which aligns token-level docid generation with document-level relevance estimation through direct optimization via pairwise ranking.
arXiv Detail & Related papers (2025-04-07T15:27:37Z)
On the Role of Feedback in Test-Time Scaling of Agentic AI Workflows [71.92083784393418]
Agentic AI (systems that autonomously plan and act) are becoming widespread, yet their task success rate on complex tasks remains low.<n>Inference-time alignment relies on three components: sampling, evaluation, and feedback.<n>We introduce Iterative Agent Decoding (IAD), a procedure that repeatedly inserts feedback extracted from different forms of critiques.
arXiv Detail & Related papers (2025-04-02T17:40:47Z)
The Dual-use Dilemma in LLMs: Do Empowering Ethical Capacities Make a Degraded Utility? [54.18519360412294]
Large Language Models (LLMs) must balance between rejecting harmful requests for safety and accommodating legitimate ones for utility.<n>This paper presents a Direct Preference Optimization (DPO) based alignment framework that achieves better overall performance.<n>We analyze experimental results obtained from testing DeepSeek-R1 on our benchmark and reveal the critical ethical concerns raised by this highly acclaimed model.
arXiv Detail & Related papers (2025-01-20T06:35:01Z)
JudgeRank: Leveraging Large Language Models for Reasoning-Intensive Reranking [81.88787401178378]
We introduce JudgeRank, a novel agentic reranker that emulates human cognitive processes when assessing document relevance. We evaluate JudgeRank on the reasoning-intensive BRIGHT benchmark, demonstrating substantial performance improvements over first-stage retrieval methods. In addition, JudgeRank performs on par with fine-tuned state-of-the-art rerankers on the popular BEIR benchmark, validating its zero-shot generalization capability.
arXiv Detail & Related papers (2024-10-31T18:43:12Z)
Enhancing Retrieval Performance: An Ensemble Approach For Hard Negative Mining [0.0]
This study focuses on explaining the crucial role of hard negatives in the training process of cross-encoder models. We have developed a robust hard negative mining technique for efficient training of cross-encoder re-rank models on an enterprise dataset.
arXiv Detail & Related papers (2024-10-18T05:23:39Z)
A Reproducibility Study of PLAID [25.86500025007641]
We compare PLAID with an important baseline missing from the paper: re-ranking a lexical system. We find that applying ColBERTv2 as a re-ranker atop an initial pool of BM25 results provides better efficiency-effectiveness trade-offs in low-latency settings. We find that recently proposed modifications to re-ranking that pull in the neighbors of top-scoring documents overcome this limitation.
arXiv Detail & Related papers (2024-04-23T12:46:53Z)
Building an Efficient and Effective Retrieval-based Dialogue System via Mutual Learning [27.04857039060308]
We propose to combine the best of both worlds to build a retrieval system. We employ a fast bi-encoder to replace the traditional feature-based pre-retrieval model. We train the pre-retrieval model and the re-ranking model at the same time via mutual learning.
arXiv Detail & Related papers (2021-10-01T01:32:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.