Exploring Large Language Models and Hierarchical Frameworks for
Classification of Large Unstructured Legal Documents
- URL: http://arxiv.org/abs/2403.06872v1
- Date: Mon, 11 Mar 2024 16:24:08 GMT
- Title: Exploring Large Language Models and Hierarchical Frameworks for
Classification of Large Unstructured Legal Documents
- Authors: Nishchal Prasad, Mohand Boughanem, Taoufiq Dkaki
- Abstract summary: We explore the classification of large legal documents and their lack of structural information with a deep-learning-based hierarchical framework.
Specifically, we divide a document into parts to extract their embeddings from the last four layers of a custom fine-tuned Large Language Model.
Our approach achieves a minimum total performance gain of approximately 2 points over previous state-of-the-art methods.
- Score: 0.6349503549199403
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Legal judgment prediction suffers from the problem of long case documents
exceeding tens of thousands of words, in general, and having a non-uniform
structure. Predicting judgments from such documents becomes a challenging task,
more so on documents with no structural annotation. We explore the
classification of these large legal documents and their lack of structural
information with a deep-learning-based hierarchical framework which we call
MESc; "Multi-stage Encoder-based Supervised with-clustering"; for judgment
prediction. Specifically, we divide a document into parts to extract their
embeddings from the last four layers of a custom fine-tuned Large Language
Model, and try to approximate their structure through unsupervised clustering.
Which we use in another set of transformer encoder layers to learn the
inter-chunk representations. We analyze the adaptability of Large Language
Models (LLMs) with multi-billion parameters (GPT-Neo, and GPT-J) with the
hierarchical framework of MESc and compare them with their standalone
performance on legal texts. We also study their intra-domain(legal) transfer
learning capability and the impact of combining embeddings from their last
layers in MESc. We test these methods and their effectiveness with extensive
experiments and ablation studies on legal documents from India, the European
Union, and the United States with the ILDC dataset and a subset of the LexGLUE
dataset. Our approach achieves a minimum total performance gain of
approximately 2 points over previous state-of-the-art methods.
Related papers
- Contextual Document Embeddings [77.22328616983417]
We propose two complementary methods for contextualized document embeddings.
First, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss.
Second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation.
arXiv Detail & Related papers (2024-10-03T14:33:34Z) - Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored.
We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches.
We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z) - A Multi-Modal Multilingual Benchmark for Document Image Classification [21.7518357653137]
We introduce two newly curated multilingual datasets WIKI-DOC and MULTIEUR-DOCLEX.
We study popular visually-rich document understanding or Document AI models in previously untested setting in document image classification.
Experimental results show limitations of multilingual Document AI models on cross-lingual transfer across typologically distant languages.
arXiv Detail & Related papers (2023-10-25T04:35:06Z) - A Hierarchical Neural Framework for Classification and its Explanation in Large Unstructured Legal Documents [0.5812284760539713]
We define this problem as "scarce annotated legal documents"
We propose a deep-learning-based classification framework which we call MESc.
We also propose an explanation extraction algorithm named ORSE.
arXiv Detail & Related papers (2023-09-19T12:18:28Z) - A Machine Learning Approach to Classifying Construction Cost Documents
into the International Construction Measurement Standard [0.0]
We introduce the first automated models for classifying natural language descriptions provided in cost documents called "Bills of Quantities"
We learn from a dataset of more than 50 thousand descriptions of items retrieved from 24 large infrastructure construction projects across the United Kingdom.
arXiv Detail & Related papers (2022-10-24T11:35:53Z) - UnifieR: A Unified Retriever for Large-Scale Retrieval [84.61239936314597]
Large-scale retrieval is to recall relevant documents from a huge collection given a query.
Recent retrieval methods based on pre-trained language models (PLM) can be coarsely categorized into either dense-vector or lexicon-based paradigms.
We propose a new learning framework, UnifieR which unifies dense-vector and lexicon-based retrieval in one model with a dual-representing capability.
arXiv Detail & Related papers (2022-05-23T11:01:59Z) - Large-Scale Multi-Document Summarization with Information Extraction and
Compression [31.601707033466766]
We develop an abstractive summarization framework independent of labeled data for multiple heterogeneous documents.
Our framework processes documents telling different stories instead of documents on the same topic.
Our experiments demonstrate that our framework outperforms current state-of-the-art methods in this more generic setting.
arXiv Detail & Related papers (2022-05-01T19:49:15Z) - Long Document Summarization with Top-down and Bottom-up Inference [113.29319668246407]
We propose a principled inference framework to improve summarization models on two aspects.
Our framework assumes a hierarchical latent structure of a document where the top-level captures the long range dependency.
We demonstrate the effectiveness of the proposed framework on a diverse set of summarization datasets.
arXiv Detail & Related papers (2022-03-15T01:24:51Z) - On Cross-Lingual Retrieval with Multilingual Text Encoders [51.60862829942932]
We study the suitability of state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks.
We benchmark their performance in unsupervised ad-hoc sentence- and document-level CLIR experiments.
We evaluate multilingual encoders fine-tuned in a supervised fashion (i.e., we learn to rank) on English relevance data in a series of zero-shot language and domain transfer CLIR experiments.
arXiv Detail & Related papers (2021-12-21T08:10:27Z) - Towards Making the Most of Context in Neural Machine Translation [112.9845226123306]
We argue that previous research did not make a clear use of the global context.
We propose a new document-level NMT framework that deliberately models the local context of each sentence.
arXiv Detail & Related papers (2020-02-19T03:30:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.