Advantages of Domain Knowledge Injection for Legal Document Summarization: A Case Study on Summarizing Indian Court Judgments in English and Hindi
- URL: http://arxiv.org/abs/2602.07382v1
- Date: Sat, 07 Feb 2026 05:55:18 GMT
- Title: Advantages of Domain Knowledge Injection for Legal Document Summarization: A Case Study on Summarizing Indian Court Judgments in English and Hindi
- Authors: Debtanu Datta, Rajdeep Mukherjee, Adrijit Goswami, Saptarshi Ghosh,
- Abstract summary: We aim to improve the summarization of Indian legal text to generate summaries in both English and Hindi.<n>We propose a framework to enhance extractive neural summarization models by incorporating domain-specific pre-trained encoders tailored for legal texts.<n>Our proposed approaches achieve statistically significant improvements in both English-to-English and English-to-Hindi Indian legal document summarization.
- Score: 6.770978279356662
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Summarizing Indian legal court judgments is a complex task not only due to the intricate language and unstructured nature of the legal texts, but also since a large section of the Indian population does not understand the complex English in which legal text is written, thus requiring summaries in Indian languages. In this study, we aim to improve the summarization of Indian legal text to generate summaries in both English and Hindi (the most widely spoken Indian language), by injecting domain knowledge into diverse summarization models. We propose a framework to enhance extractive neural summarization models by incorporating domain-specific pre-trained encoders tailored for legal texts. Further, we explore the injection of legal domain knowledge into generative models (including Large Language Models) through continual pre-training on large legal corpora in English and Hindi. Our proposed approaches achieve statistically significant improvements in both English-to-English and English-to-Hindi Indian legal document summarization, as measured by standard evaluation metrics, factual consistency metrics, and legal domain-specific metrics. Furthermore, these improvements are validated through domain experts, demonstrating the effectiveness of our approaches.
Related papers
- LegalOne: A Family of Foundation Models for Reliable Legal Reasoning [54.57434222018289]
We present LegalOne, a family of foundational models specifically tailored for the Chinese legal domain.<n>LegalOne is developed through a comprehensive three-phase pipeline designed to master legal reasoning.<n>We publicly release the LegalOne weights and the LegalKit evaluation framework to advance the field of Legal AI.
arXiv Detail & Related papers (2026-01-31T10:18:32Z) - ClaimGen-CN: A Large-scale Chinese Dataset for Legal Claim Generation [56.79698529022327]
Legal claims refer to the plaintiff's demands in a case and are essential to guiding judicial reasoning and case resolution.<n>This paper explores the problem of legal claim generation based on the given case's facts.<n>We construct ClaimGen-CN, the first dataset for Chinese legal claim generation task.
arXiv Detail & Related papers (2025-08-24T07:19:25Z) - NyayaAnumana & INLegalLlama: The Largest Indian Legal Judgment Prediction Dataset and Specialized Language Model for Enhanced Decision Analysis [5.790242888372048]
This paper introduces NyayaAnumana, the largest and most diverse corpus of Indian legal cases compiled for legal judgment prediction (LJP)<n>NyayaAnumana includes a wide range of cases from the Supreme Court, High Courts, Tribunal Courts, District Courts, and Daily Orders.<n>In addition to the dataset, we present INLegalLlama, a domain-specific generative large language model (LLM) tailored to the intricacies of the Indian legal system.
arXiv Detail & Related papers (2024-12-11T13:50:17Z) - IL-TUR: Benchmark for Indian Legal Text Understanding and Reasoning [16.12863746776168]
Legal systems worldwide are inundated with exponential growth in cases and documents.
There is an imminent need to develop NLP and ML techniques for automatically processing and understanding legal documents.
This paper proposes IL-TUR: Benchmark for Indian Legal Text Understanding and Reasoning.
arXiv Detail & Related papers (2024-07-07T14:55:04Z) - Legal Judgment Reimagined: PredEx and the Rise of Intelligent AI Interpretation in Indian Courts [6.339932924789635]
textbfPrediction with textbfExplanation (textttPredEx) is the largest expert-annotated dataset for legal judgment prediction and explanation in the Indian context.
This corpus significantly enhances the training and evaluation of AI models in legal analysis.
arXiv Detail & Related papers (2024-06-06T14:57:48Z) - DELTA: Pre-train a Discriminative Encoder for Legal Case Retrieval via Structural Word Alignment [55.91429725404988]
We introduce DELTA, a discriminative model designed for legal case retrieval.
We leverage shallow decoders to create information bottlenecks, aiming to enhance the representation ability.
Our approach can outperform existing state-of-the-art methods in legal case retrieval.
arXiv Detail & Related papers (2024-03-27T10:40:14Z) - MILDSum: A Novel Benchmark Dataset for Multilingual Summarization of
Indian Legal Case Judgments [6.522489660886997]
It is crucial to summarize the legal documents in Indian languages to ensure equitable access to justice.
This study presents a pioneering effort toward cross-lingual summarization of English legal documents into Hindi.
We construct the first high-quality legal corpus comprising of 3,122 case judgments from prominent Indian courts in English, along with their summaries in both English and Hindi.
arXiv Detail & Related papers (2023-10-28T05:51:57Z) - Precedent-Enhanced Legal Judgment Prediction with LLM and Domain-Model
Collaboration [52.57055162778548]
Legal Judgment Prediction (LJP) has become an increasingly crucial task in Legal AI.
Precedents are the previous legal cases with similar facts, which are the basis for the judgment of the subsequent case in national legal systems.
Recent advances in deep learning have enabled a variety of techniques to be used to solve the LJP task.
arXiv Detail & Related papers (2023-10-13T16:47:20Z) - SAILER: Structure-aware Pre-trained Language Model for Legal Case
Retrieval [75.05173891207214]
Legal case retrieval plays a core role in the intelligent legal system.
Most existing language models have difficulty understanding the long-distance dependencies between different structures.
We propose a new Structure-Aware pre-traIned language model for LEgal case Retrieval.
arXiv Detail & Related papers (2023-04-22T10:47:01Z) - Indian Legal Text Summarization: A Text Normalisation-based Approach [0.0]
There are more than 4 crore cases outstanding in the Indian court system.
Many state-theart models for text summarization have emerged as machine learning has progressed.
domain-independent models don't do well with legal texts.
Authors have proposed a methodology for normalising legal texts in the Indian context.
arXiv Detail & Related papers (2022-06-13T15:16:50Z) - Lawformer: A Pre-trained Language Model for Chinese Legal Long Documents [56.40163943394202]
We release the Longformer-based pre-trained language model, named as Lawformer, for Chinese legal long documents understanding.
We evaluate Lawformer on a variety of LegalAI tasks, including judgment prediction, similar case retrieval, legal reading comprehension, and legal question answering.
arXiv Detail & Related papers (2021-05-09T09:39:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.