HLDC: Hindi Legal Documents Corpus
- URL: http://arxiv.org/abs/2204.00806v2
- Date: Fri, 24 May 2024 11:07:12 GMT
- Title: HLDC: Hindi Legal Documents Corpus
- Authors: Arnav Kapoor, Mudit Dhawan, Anmol Goel, T. H. Arjun, Akshala Bhatnagar, Vibhu Agrawal, Amul Agrawal, Arnab Bhattacharya, Ponnurangam Kumaraguru, Ashutosh Modi,
- Abstract summary: We introduce the Hindi Legal Documents Corpus (HLDC), a corpus of more than 900K legal documents in Hindi.
Document are cleaned and structured to enable the development of downstream applications.
As a use-case for the corpus, we introduce the task of bail prediction.
- Score: 14.34616914884496
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Many populous countries including India are burdened with a considerable backlog of legal cases. Development of automated systems that could process legal documents and augment legal practitioners can mitigate this. However, there is a dearth of high-quality corpora that is needed to develop such data-driven systems. The problem gets even more pronounced in the case of low resource languages such as Hindi. In this resource paper, we introduce the Hindi Legal Documents Corpus (HLDC), a corpus of more than 900K legal documents in Hindi. Documents are cleaned and structured to enable the development of downstream applications. Further, as a use-case for the corpus, we introduce the task of bail prediction. We experiment with a battery of models and propose a Multi-Task Learning (MTL) based model for the same. MTL models use summarization as an auxiliary task along with bail prediction as the main task. Experiments with different models are indicative of the need for further research in this area. We release the corpus and model implementation code with this paper: https://github.com/Exploration-Lab/HLDC
Related papers
- Less is More: Making Smaller Language Models Competent Subgraph Retrievers for Multi-hop KGQA [51.3033125256716]
We model the subgraph retrieval task as a conditional generation task handled by small language models.
Our base generative subgraph retrieval model, consisting of only 220M parameters, competitive retrieval performance compared to state-of-the-art models.
Our largest 3B model, when plugged with an LLM reader, sets new SOTA end-to-end performance on both the WebQSP and CWQ benchmarks.
arXiv Detail & Related papers (2024-10-08T15:22:36Z) - Leveraging open-source models for legal language modeling and analysis: a case study on the Indian constitution [0.0]
This paper presents a novel approach to legal language modeling (LLM) and analysis using open-source models from Hugging Face.
We leverage Hugging Face embeddings via LangChain and Sentence Transformers.
We then demonstrate the application of this model by extracting insights from the official Constitution of India.
arXiv Detail & Related papers (2024-04-10T05:35:47Z) - Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research [139.69207791947738]
Dolma is a three-trillion-token English corpus built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials.
We document Dolma, including its design principles, details about its construction, and a summary of its contents.
We present analyses and experimental results on intermediate states of Dolma to share what we have learned about important data curation practices.
arXiv Detail & Related papers (2024-01-31T20:29:50Z) - Convolutional Neural Networks can achieve binary bail judgement
classification [0.5013868868152144]
We deploy a Convolutional Neural Network (CNN) architecture on a corpus of Hindi legal documents.
We perform a bail Prediction task with the help of a CNN model and achieve an overall accuracy of 93%.
arXiv Detail & Related papers (2024-01-25T12:31:41Z) - SLJP: Semantic Extraction based Legal Judgment Prediction [0.0]
Legal Judgment Prediction (LJP) is a judicial assistance system that recommends the legal components such as applicable statues, prison term and penalty term.
Most of the existing Indian models did not adequately concentrate on the semantics embedded in the fact description (FD) that impacts the decision.
The proposed semantic extraction based LJP (SLJP) model provides the advantages of pretrained transformers for complex unstructured legal case document understanding.
arXiv Detail & Related papers (2023-12-13T08:50:02Z) - Skywork: A More Open Bilingual Foundation Model [55.927396986873816]
We present Skywork-13B, a family of large language models (LLMs) trained on a corpus of over 3.2 trillion tokens drawn from both English and Chinese texts.
We show that our model not only excels on popular benchmarks, but also achieves emphstate of the art performance in Chinese language modeling on diverse domains.
arXiv Detail & Related papers (2023-10-30T08:31:47Z) - Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - LeSICiN: A Heterogeneous Graph-based Approach for Automatic Legal
Statute Identification from Indian Legal Documents [10.059041122060686]
Legal Statute Identification (LSI) aims to identify the legal statutes that are relevant to a given description of Facts or evidence of a legal case.
Existing methods only utilize the textual content of Facts and legal articles to guide such a task.
We take the first step towards utilising both the text and the legal citation network for the LSI task.
arXiv Detail & Related papers (2021-12-29T18:39:35Z) - Paraphrastic Representations at Scale [134.41025103489224]
We release trained models for English, Arabic, German, French, Spanish, Russian, Turkish, and Chinese languages.
We train these models on large amounts of data, achieving significantly improved performance from the original papers.
arXiv Detail & Related papers (2021-04-30T16:55:28Z) - Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages.
We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining.
Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.