ILSIC: Corpora for Identifying Indian Legal Statutes from Queries by Laypeople
- URL: http://arxiv.org/abs/2602.00881v1
- Date: Sat, 31 Jan 2026 20:05:48 GMT
- Title: ILSIC: Corpora for Identifying Indian Legal Statutes from Queries by Laypeople
- Authors: Shounak Paul, Raghav Dogra, Pawan Goyal, Saptarshi Ghosh,
- Abstract summary: Legal Statute Identification (LSI) for a given situation is one of the most fundamental tasks in Legal NLP.<n>In this work, we create ILSIC, a corpus of laypeople queries covering 500+ statutes from Indian law.
- Score: 7.998373645118032
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Legal Statute Identification (LSI) for a given situation is one of the most fundamental tasks in Legal NLP. This task has traditionally been modeled using facts from court judgments as input queries, due to their abundance. However, in practical settings, the input queries are likely to be informal and asked by laypersons, or non-professionals. While a few laypeople LSI datasets exist, there has been little research to explore the differences between court and laypeople data for LSI. In this work, we create ILSIC, a corpus of laypeople queries covering 500+ statutes from Indian law. Additionally, the corpus also contains court case judgements to enable researchers to effectively compare between court and laypeople data for LSI. We conducted extensive experiments on our corpus, including benchmarking over the laypeople dataset using zero and few-shot inference, retrieval-augmented generation and supervised fine-tuning. We observe that models trained purely on court judgements are ineffective during test on laypeople queries, while transfer learning from court to laypeople data can be beneficial in certain scenarios. We also conducted fine-grained analyses of our results in terms of categories of queries and frequency of statutes.
Related papers
- Are LLMs Court-Ready? Evaluating Frontier Models on Indian Legal Reasoning [0.5308136763388956]
We use India's public legal examinations as a transparent proxy.<n>Our benchmark assembles objective screens from top national and state exams.<n>We also include a lawyer-graded, paired-blinded study of long-form answers from the Supreme Court's Advocate-on-Record exam.
arXiv Detail & Related papers (2025-10-19T10:04:29Z) - LegalSearchLM: Rethinking Legal Case Retrieval as Legal Elements Generation [9.894351313663874]
Legal Case Retrieval (LCR) is a fundamental task for legal professionals.<n>Existing studies on LCR face two major limitations.<n>First, they are evaluated on relatively small-scale retrieval corpora.<n>Second, their reliance on embedding-based or lexical matching methods often results in limited representations and legally irrelevant matches.
arXiv Detail & Related papers (2025-05-28T09:02:41Z) - LEXam: Benchmarking Legal Reasoning on 340 Law Exams [76.3521146499006]
We introduce textscLEXam, a novel benchmark derived from 340 law exams spanning 116 law school courses across a range of subjects and degree levels.<n>The dataset comprises 4,886 law exam questions in English and German, including 2,841 long-form, open-ended questions and 2,045 multiple-choice questions.<n>Our results underscore the effectiveness of the dataset in differentiating between models with varying capabilities.
arXiv Detail & Related papers (2025-05-19T08:48:12Z) - AnnoCaseLaw: A Richly-Annotated Dataset For Benchmarking Explainable Legal Judgment Prediction [56.797874973414636]
AnnoCaseLaw is a first-of-its-kind dataset of 471 meticulously annotated U.S. Appeals Court negligence cases.<n>Our dataset lays the groundwork for more human-aligned, explainable Legal Judgment Prediction models.<n>Results demonstrate that LJP remains a formidable task, with application of legal precedent proving particularly difficult.
arXiv Detail & Related papers (2025-02-28T19:14:48Z) - Enhancing Legal Case Retrieval via Scaling High-quality Synthetic Query-Candidate Pairs [67.54302101989542]
Legal case retrieval aims to provide similar cases as references for a given fact description.
Existing works mainly focus on case-to-case retrieval using lengthy queries.
Data scale is insufficient to satisfy the training requirements of existing data-hungry neural models.
arXiv Detail & Related papers (2024-10-09T06:26:39Z) - Query-driven Relevant Paragraph Extraction from Legal Judgments [1.2562034805037443]
Legal professionals often grapple with navigating lengthy legal judgements to pinpoint information that directly address their queries.
This paper focus on this task of extracting relevant paragraphs from legal judgements based on the query.
We construct a specialized dataset for this task from the European Court of Human Rights (ECtHR) using the case law guides.
arXiv Detail & Related papers (2024-03-31T08:03:39Z) - DELTA: Pre-train a Discriminative Encoder for Legal Case Retrieval via Structural Word Alignment [55.91429725404988]
We introduce DELTA, a discriminative model designed for legal case retrieval.
We leverage shallow decoders to create information bottlenecks, aiming to enhance the representation ability.
Our approach can outperform existing state-of-the-art methods in legal case retrieval.
arXiv Detail & Related papers (2024-03-27T10:40:14Z) - Low-Resource Court Judgment Summarization for Common Law Systems [32.13166048504629]
We present CLSum, the first dataset for summarizing multi-jurisdictional common law court judgment documents.
This is the first court judgment summarization work adopting large language models (LLMs) in data augmentation, summary generation, and evaluation.
arXiv Detail & Related papers (2024-03-07T12:47:42Z) - MUSER: A Multi-View Similar Case Retrieval Dataset [65.36779942237357]
Similar case retrieval (SCR) is a representative legal AI application that plays a pivotal role in promoting judicial fairness.
Existing SCR datasets only focus on the fact description section when judging the similarity between cases.
We present M, a similar case retrieval dataset based on multi-view similarity measurement and comprehensive legal element with sentence-level legal element annotations.
arXiv Detail & Related papers (2023-10-24T08:17:11Z) - SAILER: Structure-aware Pre-trained Language Model for Legal Case
Retrieval [75.05173891207214]
Legal case retrieval plays a core role in the intelligent legal system.
Most existing language models have difficulty understanding the long-distance dependencies between different structures.
We propose a new Structure-Aware pre-traIned language model for LEgal case Retrieval.
arXiv Detail & Related papers (2023-04-22T10:47:01Z) - JUSTICE: A Benchmark Dataset for Supreme Court's Judgment Prediction [0.0]
We aim to create a high-quality dataset of SCOTUS court cases so that they may be readily used in natural language processing (NLP) research and other data-driven applications.
By using advanced NLP algorithms to analyze previous court cases, the trained models are able to predict and classify a court's judgment.
arXiv Detail & Related papers (2021-12-06T23:19:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.