Named entity recognition for Serbian legal documents: Design, methodology and dataset development
- URL: http://arxiv.org/abs/2502.10582v1
- Date: Fri, 14 Feb 2025 22:23:39 GMT
- Title: Named entity recognition for Serbian legal documents: Design, methodology and dataset development
- Authors: Vladimir Kalušev, Branko Brkljač,
- Abstract summary: We present one solution for Named Entity Recognition (NER) in the case of legal documents written in Serbian language.
It leverages on the pre-trained bidirectional encoder representations from transformers (BERT), which had been carefully adapted to the specific task of identifying and classifying specific data points from textual content.
- Score: 0.0
- License:
- Abstract: Recent advancements in the field of natural language processing (NLP) and especially large language models (LLMs) and their numerous applications have brought research attention to design of different document processing tools and enhancements in the process of document archiving, search and retrieval. Domain of official, legal documents is especially interesting due to vast amount of data generated on the daily basis, as well as the significant community of interested practitioners (lawyers, law offices, administrative workers, state institutions and citizens). Providing efficient ways for automation of everyday work involving legal documents is therefore expected to have significant impact in different fields. In this work we present one LLM based solution for Named Entity Recognition (NER) in the case of legal documents written in Serbian language. It leverages on the pre-trained bidirectional encoder representations from transformers (BERT), which had been carefully adapted to the specific task of identifying and classifying specific data points from textual content. Besides novel dataset development for Serbian language (involving public court rulings), presented system design and applied methodology, the paper also discusses achieved performance metrics and their implications for objective assessment of the proposed solution. Performed cross-validation tests on the created manually labeled dataset with mean $F_1$ score of 0.96 and additional results on the examples of intentionally modified text inputs confirm applicability of the proposed system design and robustness of the developed NER solution.
Related papers
- Optimizing Multi-Stage Language Models for Effective Text Retrieval [0.0]
We introduce a novel two-phase text retrieval pipeline optimized for Japanese legal datasets.
Our method leverages advanced language models to achieve state-of-the-art performance.
To further enhance robustness and adaptability, we incorporate an ensemble model that integrates multiple retrieval strategies.
arXiv Detail & Related papers (2024-12-26T16:05:19Z) - Contextual Document Embeddings [77.22328616983417]
We propose two complementary methods for contextualized document embeddings.
First, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss.
Second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation.
arXiv Detail & Related papers (2024-10-03T14:33:34Z) - Large Language Models for Judicial Entity Extraction: A Comparative Study [0.0]
This research investigates the application of Large Language Models in identifying domain-specific entities within case law documents.
The study evaluates the performance of state-of-the-art Large Language Model architectures, including Large Language Model Meta AI 3, Mistral, and Gemma.
arXiv Detail & Related papers (2024-07-08T09:49:03Z) - Empowering Prior to Court Legal Analysis: A Transparent and Accessible Dataset for Defensive Statement Classification and Interpretation [5.646219481667151]
This paper introduces a novel dataset tailored for classification of statements made during police interviews, prior to court proceedings.
We introduce a fine-tuned DistilBERT model that achieves state-of-the-art performance in distinguishing truthful from deceptive statements.
We also present an XAI interface that empowers both legal professionals and non-specialists to interact with and benefit from our system.
arXiv Detail & Related papers (2024-05-17T11:22:27Z) - On Task-personalized Multimodal Few-shot Learning for Visually-rich
Document Entity Retrieval [59.25292920967197]
Few-shot document entity retrieval (VDER) is an important topic in industrial NLP applications.
FewVEX is a new dataset to boost future research in the field of entity-level few-shot VDER.
We present a task-aware meta-learning based framework, with a central focus on achieving effective task personalization.
arXiv Detail & Related papers (2023-11-01T17:51:43Z) - SAILER: Structure-aware Pre-trained Language Model for Legal Case
Retrieval [75.05173891207214]
Legal case retrieval plays a core role in the intelligent legal system.
Most existing language models have difficulty understanding the long-distance dependencies between different structures.
We propose a new Structure-Aware pre-traIned language model for LEgal case Retrieval.
arXiv Detail & Related papers (2023-04-22T10:47:01Z) - Development and validation of a natural language processing algorithm to
pseudonymize documents in the context of a clinical data warehouse [53.797797404164946]
The study highlights the difficulties faced in sharing tools and resources in this domain.
We annotated a corpus of clinical documents according to 12 types of identifying entities.
We build a hybrid system, merging the results of a deep learning model as well as manual rules.
arXiv Detail & Related papers (2023-03-23T17:17:46Z) - Layout-Aware Information Extraction for Document-Grounded Dialogue:
Dataset, Method and Demonstration [75.47708732473586]
We propose a layout-aware document-level Information Extraction dataset, LIE, to facilitate the study of extracting both structural and semantic knowledge from visually rich documents.
LIE contains 62k annotations of three extraction tasks from 4,061 pages in product and official documents.
Empirical results show that layout is critical for VRD-based extraction, and system demonstration also verifies that the extracted knowledge can help locate the answers that users care about.
arXiv Detail & Related papers (2022-07-14T07:59:45Z) - Documenting Data Production Processes: A Participatory Approach for Data
Work [4.811554861191618]
opacity of machine learning data is a significant threat to ethical data work and intelligible systems.
Previous research has proposed standardized checklists to document datasets.
This paper proposes a shift of perspective: from documenting datasets toward documenting data production.
arXiv Detail & Related papers (2022-07-11T15:39:02Z) - GERE: Generative Evidence Retrieval for Fact Verification [57.78768817972026]
We propose GERE, the first system that retrieves evidences in a generative fashion.
The experimental results on the FEVER dataset show that GERE achieves significant improvements over the state-of-the-art baselines.
arXiv Detail & Related papers (2022-04-12T03:49:35Z) - Classification of Contract-Amendment Relationships [0.0]
We propose an approach based on machine learning (ML) and Natural Language Processing (NLP) to detect the amendment relationship between two documents.
The algorithm takes two PDF documents preprocessed by OCR (Optical Character Recognition) and NER (Named Entity Recognition) as input, and then it builds the features of each document pair.
arXiv Detail & Related papers (2021-06-08T07:57:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.