Related papers: Structured Legal Document Generation in India: A Model-Agnostic Wrapper Approach with VidhikDastaavej

Structured Legal Document Generation in India: A Model-Agnostic Wrapper Approach with VidhikDastaavej

URL: http://arxiv.org/abs/2504.03486v1
Date: Fri, 04 Apr 2025 14:41:50 GMT
Title: Structured Legal Document Generation in India: A Model-Agnostic Wrapper Approach with VidhikDastaavej
Authors: Shubham Kumar Nigam, Balaramamahanthi Deepak Patnaik, Ajay Varghese Thomas, Noel Shallum, Kripabandhu Ghosh, Arnab Bhattacharya,
Abstract summary: We introduce VidhikDastaavej, a novel, anonymized dataset of private legal documents.<n>We develop NyayaShilp, a fine-tuned legal document generation model specifically adapted to Indian legal texts.
Score: 5.790242888372048
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automating legal document drafting can significantly enhance efficiency, reduce manual effort, and streamline legal workflows. While prior research has explored tasks such as judgment prediction and case summarization, the structured generation of private legal documents in the Indian legal domain remains largely unaddressed. To bridge this gap, we introduce VidhikDastaavej, a novel, anonymized dataset of private legal documents, and develop NyayaShilp, a fine-tuned legal document generation model specifically adapted to Indian legal texts. We propose a Model-Agnostic Wrapper (MAW), a two-step framework that first generates structured section titles and then iteratively produces content while leveraging retrieval-based mechanisms to ensure coherence and factual accuracy. We benchmark multiple open-source LLMs, including instruction-tuned and domain-adapted versions, alongside proprietary models for comparison. Our findings indicate that while direct fine-tuning on small datasets does not always yield improvements, our structured wrapper significantly enhances coherence, factual adherence, and overall document quality while mitigating hallucinations. To ensure real-world applicability, we developed a Human-in-the-Loop (HITL) Document Generation System, an interactive user interface that enables users to specify document types, refine section details, and generate structured legal drafts. This tool allows legal professionals and researchers to generate, validate, and refine AI-generated legal documents efficiently. Extensive evaluations, including expert assessments, confirm that our framework achieves high reliability in structured legal drafting. This research establishes a scalable and adaptable foundation for AI-assisted legal drafting in India, offering an effective approach to structured legal document generation.

Related papers

Model Editing for New Document Integration in Generative Information Retrieval [110.90609826290968]
Generative retrieval (GR) reformulates the Information Retrieval (IR) task as the generation of document identifiers (docIDs)<n>Existing GR models exhibit poor generalization to newly added documents, often failing to generate the correct docIDs.<n>We propose DOME, a novel method that effectively and efficiently adapts GR models to unseen documents.
arXiv Detail & Related papers (2026-03-03T09:13:38Z)
ABCD-LINK: Annotation Bootstrapping for Cross-Document Fine-Grained Links [57.514511353084565]
We introduce a new domain-agnostic framework for selecting a best-performing approach and annotating cross-document links.<n>We apply our framework in two distinct domains -- peer review and news.<n>The resulting novel datasets lay foundation for numerous cross-document tasks like media framing and peer review.
arXiv Detail & Related papers (2025-09-01T11:32:24Z)
Hybrid Topic-Semantic Labeling and Graph Embeddings for Unsupervised Legal Document Clustering [1.6267479602370543]
This paper proposes a hybrid approach for classifying legal texts by combining unsupervised topic and graph embeddings with a supervised model.<n>We employ Top2Vec to learn semantic document embeddings and automatically discover latent topics, and Node2Vec to capture structural relationships via a bipartite graph of legal documents.<n>Our computations on a legal document dataset demonstrate that the combined Top2Vec+Node2Vec approach improves clustering quality over text-only or graph-only embeddings.
arXiv Detail & Related papers (2025-08-31T20:53:59Z)
ClaimGen-CN: A Large-scale Chinese Dataset for Legal Claim Generation [56.79698529022327]
Legal claims refer to the plaintiff's demands in a case and are essential to guiding judicial reasoning and case resolution.<n>This paper explores the problem of legal claim generation based on the given case's facts.<n>We construct ClaimGen-CN, the first dataset for Chinese legal claim generation task.
arXiv Detail & Related papers (2025-08-24T07:19:25Z)
DREAM: Document Reconstruction via End-to-end Autoregressive Model [53.51754520966657]
We present an innovative autoregressive model specifically designed for document reconstruction, referred to as Document Reconstruction via End-to-end Autoregressive Model (DREAM)<n>We establish a standardized definition of the document reconstruction task, and introduce a novel Document Similarity Metric (DSM) and DocRec1K dataset for assessing the performance of the task.
arXiv Detail & Related papers (2025-07-08T09:24:07Z)
JuDGE: Benchmarking Judgment Document Generation for Chinese Legal System [12.256518096712334]
JuDGE (Judgment Document Generation Evaluation) is a novel benchmark for evaluating the performance of judgment document generation in the Chinese legal system.<n>We construct a comprehensive dataset consisting of factual descriptions from real legal cases, paired with their corresponding full judgment documents.<n>In collaboration with legal professionals, we establish a comprehensive automated evaluation framework to assess the quality of generated judgment documents.
arXiv Detail & Related papers (2025-03-18T13:48:18Z)
CaseGen: A Benchmark for Multi-Stage Legal Case Documents Generation [22.98779736851499]
We introduce CaseGen, the benchmark for multi-stage legal case documents generation in the Chinese legal domain.<n>CaseGen is based on 500 real case samples annotated by legal experts and covers seven essential case sections.<n>It supports four key tasks: drafting defense statements, writing trial facts, composing legal reasoning, and generating judgment results.
arXiv Detail & Related papers (2025-02-25T08:03:32Z)
Named entity recognition for Serbian legal documents: Design, methodology and dataset development [0.0]
We present one solution for Named Entity Recognition (NER) in the case of legal documents written in Serbian language.<n>It leverages on the pre-trained bidirectional encoder representations from transformers (BERT), which had been carefully adapted to the specific task of identifying and classifying specific data points from textual content.
arXiv Detail & Related papers (2025-02-14T22:23:39Z)
DocMIA: Document-Level Membership Inference Attacks against DocVQA Models [52.13818827581981]
We introduce two novel membership inference attacks tailored specifically to DocVQA models.<n>Our methods outperform existing state-of-the-art membership inference attacks across a variety of DocVQA models and datasets.
arXiv Detail & Related papers (2025-02-06T00:58:21Z)
Improving Legal Entity Recognition Using a Hybrid Transformer Model and Semantic Filtering Approach [0.0]
This paper proposes a novel hybrid model that enhances the accuracy and precision of Legal-BERT, a transformer model fine-tuned for legal text processing. We evaluate the model on a dataset of 15,000 annotated legal documents, achieving an F1 score of 93.4%, demonstrating significant improvements in precision and recall over previous methods.
arXiv Detail & Related papers (2024-10-11T04:51:28Z)
Contextual Document Embeddings [77.22328616983417]
We propose two complementary methods for contextualized document embeddings. First, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss. Second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation.
arXiv Detail & Related papers (2024-10-03T14:33:34Z)
Re3: A Holistic Framework and Dataset for Modeling Collaborative Document Revision [62.12545440385489]
We introduce Re3, a framework for joint analysis of collaborative document revision. We present Re3-Sci, a large corpus of aligned scientific paper revisions manually labeled according to their action and intent. We use the new data to provide first empirical insights into collaborative document revision in the academic domain.
arXiv Detail & Related papers (2024-05-31T21:19:09Z)
Enhancing Pre-Trained Language Models with Sentence Position Embeddings for Rhetorical Roles Recognition in Legal Opinions [0.16385815610837165]
The size of legal opinions continues to grow, making it increasingly challenging to develop a model that can accurately predict the rhetorical roles of legal opinions. We propose a novel model architecture for automatically predicting rhetorical roles using pre-trained language models (PLMs) enhanced with knowledge of sentence position information. Based on an annotated corpus from the LegalEval@SemEval2023 competition, we demonstrate that our approach requires fewer parameters, resulting in lower computational costs.
arXiv Detail & Related papers (2023-10-08T20:33:55Z)
Continual Learning for Generative Retrieval over Dynamic Corpora [115.79012933205756]
Generative retrieval (GR) directly predicts the identifiers of relevant documents (i.e., docids) based on a parametric model.<n>The ability to incrementally index new documents while preserving the ability to answer queries is vital to applying GR models.<n>We put forward a novel Continual-LEarner for generatiVE Retrieval (CLEVER) model and make two major contributions to continual learning for GR.
arXiv Detail & Related papers (2023-08-29T01:46:06Z)
SAILER: Structure-aware Pre-trained Language Model for Legal Case Retrieval [75.05173891207214]
Legal case retrieval plays a core role in the intelligent legal system. Most existing language models have difficulty understanding the long-distance dependencies between different structures. We propose a new Structure-Aware pre-traIned language model for LEgal case Retrieval.
arXiv Detail & Related papers (2023-04-22T10:47:01Z)
WiCE: Real-World Entailment for Claims in Wikipedia [63.234352061821625]
We propose WiCE, a new fine-grained textual entailment dataset built on natural claim and evidence pairs extracted from Wikipedia. In addition to standard claim-level entailment, WiCE provides entailment judgments over sub-sentence units of the claim. We show that real claims in our dataset involve challenging verification and retrieval problems that existing models fail to address.
arXiv Detail & Related papers (2023-03-02T17:45:32Z)
GERE: Generative Evidence Retrieval for Fact Verification [57.78768817972026]
We propose GERE, the first system that retrieves evidences in a generative fashion. The experimental results on the FEVER dataset show that GERE achieves significant improvements over the state-of-the-art baselines.
arXiv Detail & Related papers (2022-04-12T03:49:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.