LeCaRDv2: A Large-Scale Chinese Legal Case Retrieval Dataset
- URL: http://arxiv.org/abs/2310.17609v1
- Date: Thu, 26 Oct 2023 17:32:55 GMT
- Title: LeCaRDv2: A Large-Scale Chinese Legal Case Retrieval Dataset
- Authors: Haitao Li, Yunqiu Shao, Yueyue Wu, Qingyao Ai, Yixiao Ma, Yiqun Liu
- Abstract summary: We introduce LeCaRDv2, a large-scale Legal Case Retrieval dataset (version 2).
It consists of 800 queries and 55,192 candidates extracted from 4.3 million criminal case documents.
We enrich the existing relevance criteria by considering three key aspects: characterization, penalty, procedure.
It's important to note that all cases in the dataset have been annotated by multiple legal experts specializing in criminal law.
- Score: 20.315416393247247
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As an important component of intelligent legal systems, legal case retrieval
plays a critical role in ensuring judicial justice and fairness. However, the
development of legal case retrieval technologies in the Chinese legal system is
restricted by three problems in existing datasets: limited data size, narrow
definitions of legal relevance, and naive candidate pooling strategies used in
data sampling. To alleviate these issues, we introduce LeCaRDv2, a large-scale
Legal Case Retrieval Dataset (version 2). It consists of 800 queries and 55,192
candidates extracted from 4.3 million criminal case documents. To the best of
our knowledge, LeCaRDv2 is one of the largest Chinese legal case retrieval
datasets, providing extensive coverage of criminal charges. Additionally, we
enrich the existing relevance criteria by considering three key aspects:
characterization, penalty, procedure. This comprehensive criteria enriches the
dataset and may provides a more holistic perspective. Furthermore, we propose a
two-level candidate set pooling strategy that effectively identify potential
candidates for each query case. It's important to note that all cases in the
dataset have been annotated by multiple legal experts specializing in criminal
law. Their expertise ensures the accuracy and reliability of the annotations.
We evaluate several state-of-the-art retrieval models at LeCaRDv2,
demonstrating that there is still significant room for improvement in legal
case retrieval. The details of LeCaRDv2 can be found at the anonymous website
https://github.com/anonymous1113243/LeCaRDv2.
Related papers
- Enhancing Legal Case Retrieval via Scaling High-quality Synthetic Query-Candidate Pairs [67.54302101989542]
Legal case retrieval aims to provide similar cases as references for a given fact description.
Existing works mainly focus on case-to-case retrieval using lengthy queries.
Data scale is insufficient to satisfy the training requirements of existing data-hungry neural models.
arXiv Detail & Related papers (2024-10-09T06:26:39Z) - Learning Interpretable Legal Case Retrieval via Knowledge-Guided Case Reformulation [22.85652668826498]
This paper introduces KELLER, a legal knowledge-guided case reformulation approach based on large language models (LLMs)
By incorporating professional legal knowledge about crimes and law articles, we enable large language models to accurately reformulate the original legal case into concise sub-facts of crimes.
arXiv Detail & Related papers (2024-06-28T08:59:45Z) - DELTA: Pre-train a Discriminative Encoder for Legal Case Retrieval via Structural Word Alignment [55.91429725404988]
We introduce DELTA, a discriminative model designed for legal case retrieval.
We leverage shallow decoders to create information bottlenecks, aiming to enhance the representation ability.
Our approach can outperform existing state-of-the-art methods in legal case retrieval.
arXiv Detail & Related papers (2024-03-27T10:40:14Z) - MUSER: A Multi-View Similar Case Retrieval Dataset [65.36779942237357]
Similar case retrieval (SCR) is a representative legal AI application that plays a pivotal role in promoting judicial fairness.
Existing SCR datasets only focus on the fact description section when judging the similarity between cases.
We present M, a similar case retrieval dataset based on multi-view similarity measurement and comprehensive legal element with sentence-level legal element annotations.
arXiv Detail & Related papers (2023-10-24T08:17:11Z) - An Intent Taxonomy of Legal Case Retrieval [43.22489520922202]
Legal case retrieval is a special Information Retrieval(IR) task focusing on legal case documents.
We present a novel hierarchical intent taxonomy of legal case retrieval.
We reveal significant differences in user behavior and satisfaction under different search intents in legal case retrieval.
arXiv Detail & Related papers (2023-07-25T07:27:32Z) - SAILER: Structure-aware Pre-trained Language Model for Legal Case
Retrieval [75.05173891207214]
Legal case retrieval plays a core role in the intelligent legal system.
Most existing language models have difficulty understanding the long-distance dependencies between different structures.
We propose a new Structure-Aware pre-traIned language model for LEgal case Retrieval.
arXiv Detail & Related papers (2023-04-22T10:47:01Z) - Legal Element-oriented Modeling with Multi-view Contrastive Learning for
Legal Case Retrieval [3.909749182759558]
We propose an interaction-focused network for legal case retrieval with a multi-view contrastive learning objective.
Case-view contrastive learning minimizes the hidden space distance between relevant legal case representations.
We employ a legal element knowledge-aware indicator to detect legal elements of cases.
arXiv Detail & Related papers (2022-10-11T06:47:23Z) - LEVEN: A Large-Scale Chinese Legal Event Detection Dataset [82.44096140591675]
We present LEVEN, a large-scale Chinese LEgal eVENt detection dataset, with 8,116 legal documents and 150,977 human-annotated event mentions in 108 event types.
LEVEN is the largest Legal Event Detection dataset and has dozens of times the data scale of others, which shall significantly promote the training and evaluation of LED methods.
arXiv Detail & Related papers (2022-03-16T11:40:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.