Related papers: LLM-based Embedders for Prior Case Retrieval

LLM-based Embedders for Prior Case Retrieval

URL: http://arxiv.org/abs/2507.18455v1
Date: Thu, 24 Jul 2025 14:36:10 GMT
Title: LLM-based Embedders for Prior Case Retrieval
Authors: Damith Premasiri, Tharindu Ranasinghe, Ruslan Mitkov,
Abstract summary: Prior case retrieval (PCR) is an information retrieval task that aims to automatically identify the most relevant court cases.<n>The state-of-the-art deep learning IR methods have not been successful inPCR due to two key challenges.<n>Due to data privacy concerns, available PCR datasets are often limited in size, making it difficult to train deep learning-based models effectively.
Score: 9.770692788739868
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In common law systems, legal professionals such as lawyers and judges rely on precedents to build their arguments. As the volume of cases has grown massively over time, effectively retrieving prior cases has become essential. Prior case retrieval (PCR) is an information retrieval (IR) task that aims to automatically identify the most relevant court cases for a specific query from a large pool of potential candidates. While IR methods have seen several paradigm shifts over the last few years, the vast majority of PCR methods continue to rely on traditional IR methods, such as BM25. The state-of-the-art deep learning IR methods have not been successful in PCR due to two key challenges: i. Lengthy legal text limitation; when using the powerful BERT-based transformer models, there is a limit of input text lengths, which inevitably requires to shorten the input via truncation or division with a loss of legal context information. ii. Lack of legal training data; due to data privacy concerns, available PCR datasets are often limited in size, making it difficult to train deep learning-based models effectively. In this research, we address these challenges by leveraging LLM-based text embedders in PCR. LLM-based embedders support longer input lengths, and since we use them in an unsupervised manner, they do not require training data, addressing both challenges simultaneously. In this paper, we evaluate state-of-the-art LLM-based text embedders in four PCR benchmark datasets and show that they outperform BM25 and supervised transformer-based models.

Related papers

Segment First, Retrieve Better: Realistic Legal Search via Rhetorical Role-Based Queries [3.552993426200889]
TraceRetriever mirrors real-world legal search by operating with limited case information.<n>Our pipeline integrates BM25, Vector Database, and Cross-Encoder models, combining initial results through Reciprocal Rank Fusion.<n> Rhetorical annotations are generated using a Hierarchical BiLSTM CRF classifier trained on Indian judgments.
arXiv Detail & Related papers (2025-08-01T14:49:33Z)
Truth in the Few: High-Value Data Selection for Efficient Multi-Modal Reasoning [71.3533541927459]
We propose a novel data selection paradigm termed Activation Reasoning Potential (RAP)<n>RAP identifies cognitive samples by estimating each sample's potential to stimulate genuine multi-modal reasoning.<n>Our RAP method consistently achieves superior performance using only 9.3% of the training data, while reducing computational costs by over 43%.
arXiv Detail & Related papers (2025-06-05T08:40:24Z)
Enhancing Legal Case Retrieval via Scaling High-quality Synthetic Query-Candidate Pairs [67.54302101989542]
Legal case retrieval aims to provide similar cases as references for a given fact description. Existing works mainly focus on case-to-case retrieval using lengthy queries. Data scale is insufficient to satisfy the training requirements of existing data-hungry neural models.
arXiv Detail & Related papers (2024-10-09T06:26:39Z)
Tractable Offline Learning of Regular Decision Processes [50.11277112628193]
This work studies offline Reinforcement Learning (RL) in a class of non-Markovian environments called Regular Decision Processes (RDPs) Ins, the unknown dependency of future observations and rewards from the past interactions can be captured experimentally. Many algorithms first reconstruct this unknown dependency using automata learning techniques.
arXiv Detail & Related papers (2024-09-04T14:26:58Z)
LawLLM: Law Large Language Model for the US Legal System [43.13850456765944]
We introduce the Law Large Language Model (LawLLM), a multi-task model specifically designed for the US legal domain. LawLLM excels at Similar Case Retrieval (SCR), Precedent Case Recommendation (PCR), and Legal Judgment Prediction (LJP) We propose customized data preprocessing techniques for each task that transform raw legal data into a trainable format.
arXiv Detail & Related papers (2024-07-27T21:51:30Z)
Superposition Prompting: Improving and Accelerating Retrieval-Augmented Generation [22.124234811959532]
Large language models (LLMs) exhibit significant drawbacks when processing long contexts. We propose a novel RAG prompting methodology, which can be directly applied to pre-trained transformer-based LLMs. We demonstrate the capability of our method to simultaneously enhance time efficiency across a variety of question-answering benchmarks.
arXiv Detail & Related papers (2024-04-10T11:03:17Z)
ECtHR-PCR: A Dataset for Precedent Understanding and Prior Case Retrieval in the European Court of Human Rights [1.3723120574076126]
We develop a prior case retrieval dataset based on judgements from the European Court of Human Rights (ECtHR) We benchmark different lexical and dense retrieval approaches with various negative sampling strategies. We find that difficulty-based negative sampling strategies were not effective for the PCR task.
arXiv Detail & Related papers (2024-03-31T08:06:54Z)
Enhancing Legal Document Retrieval: A Multi-Phase Approach with Large Language Models [7.299483088092052]
This research focuses on maximizing the potential of prompting by placing it as the final phase of the retrieval system. Experiments on the COLIEE 2023 dataset demonstrate that integrating prompting techniques on LLMs into the retrieval system significantly improves retrieval accuracy. However, error analysis reveals several existing issues in the retrieval system that still need resolution.
arXiv Detail & Related papers (2024-03-26T20:25:53Z)
Reliable, Adaptable, and Attributable Language Models with Retrieval [144.26890121729514]
Parametric language models (LMs) are trained on vast amounts of web data. They face practical challenges such as hallucinations, difficulty in adapting to new data distributions, and a lack of verifiability. We advocate for retrieval-augmented LMs to replace parametric LMs as the next generation of LMs.
arXiv Detail & Related papers (2024-03-05T18:22:33Z)
LaRS: Latent Reasoning Skills for Chain-of-Thought Reasoning [61.7853049843921]
Chain-of-thought (CoT) prompting is a popular in-context learning approach for large language models (LLMs)<n>This paper introduces a new approach named Latent Reasoning Skills (LaRS) that employs unsupervised learning to create a latent space representation of rationales.
arXiv Detail & Related papers (2023-12-07T20:36:10Z)
Synergistic Interplay between Search and Large Language Models for Information Retrieval [141.18083677333848]
InteR allows RMs to expand knowledge in queries using LLM-generated knowledge collections. InteR achieves overall superior zero-shot retrieval performance compared to state-of-the-art methods.
arXiv Detail & Related papers (2023-05-12T11:58:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.