Datasets for Portuguese Legal Semantic Textual Similarity: Comparing
weak supervision and an annotation process approaches
- URL: http://arxiv.org/abs/2306.00007v1
- Date: Mon, 29 May 2023 18:27:10 GMT
- Title: Datasets for Portuguese Legal Semantic Textual Similarity: Comparing
weak supervision and an annotation process approaches
- Authors: Daniel da Silva Junior, Paulo Roberto dos S. Corval, Aline Paes and
Daniel de Oliveira
- Abstract summary: Brazilian National Council of Justice has established in Resolution 469/2022 formal guidance for document and process digitalization.
This article contributes with four datasets from the legal domain, two with documents and metadata but unlabeled, and another two labeled with a aiming at its use in textual similarity tasks.
The analysis of ground truth labels highlights that semantic analysis of domain text can be challenging even for domain experts.
- Score: 1.9244230111838758
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Brazilian judiciary has a large workload, resulting in a long time to
finish legal proceedings. Brazilian National Council of Justice has established
in Resolution 469/2022 formal guidance for document and process digitalization
opening up the possibility of using automatic techniques to help with everyday
tasks in the legal field, particularly in a large number of texts yielded on
the routine of law procedures. Notably, Artificial Intelligence (AI) techniques
allow for processing and extracting useful information from textual data,
potentially speeding up the process. However, datasets from the legal domain
required by several AI techniques are scarce and difficult to obtain as they
need labels from experts. To address this challenge, this article contributes
with four datasets from the legal domain, two with documents and metadata but
unlabeled, and another two labeled with a heuristic aiming at its use in
textual semantic similarity tasks. Also, to evaluate the effectiveness of the
proposed heuristic label process, this article presents a small ground truth
dataset generated from domain expert annotations. The analysis of ground truth
labels highlights that semantic analysis of domain text can be challenging even
for domain experts. Also, the comparison between ground truth and heuristic
labels shows that heuristic labels are useful.
Related papers
- Judgement Citation Retrieval using Contextual Similarity [0.0]
We propose a methodology that combines natural language processing (NLP) and machine learning techniques to enhance the organization and utilization of legal case descriptions.
Our methodology addresses two primary objectives: unsupervised clustering and supervised citation retrieval.
Our methodology achieved an impressive accuracy rate of 90.9%.
arXiv Detail & Related papers (2024-05-28T04:22:28Z) - Empowering Prior to Court Legal Analysis: A Transparent and Accessible Dataset for Defensive Statement Classification and Interpretation [5.646219481667151]
This paper introduces a novel dataset tailored for classification of statements made during police interviews, prior to court proceedings.
We introduce a fine-tuned DistilBERT model that achieves state-of-the-art performance in distinguishing truthful from deceptive statements.
We also present an XAI interface that empowers both legal professionals and non-specialists to interact with and benefit from our system.
arXiv Detail & Related papers (2024-05-17T11:22:27Z) - Seed-Guided Fine-Grained Entity Typing in Science and Engineering
Domains [51.02035914828596]
We study the task of seed-guided fine-grained entity typing in science and engineering domains.
We propose SEType which first enriches the weak supervision by finding more entities for each seen type from an unlabeled corpus.
It then matches the enriched entities to unlabeled text to get pseudo-labeled samples and trains a textual entailment model that can make inferences for both seen and unseen types.
arXiv Detail & Related papers (2024-01-23T22:36:03Z) - MUSER: A Multi-View Similar Case Retrieval Dataset [65.36779942237357]
Similar case retrieval (SCR) is a representative legal AI application that plays a pivotal role in promoting judicial fairness.
Existing SCR datasets only focus on the fact description section when judging the similarity between cases.
We present M, a similar case retrieval dataset based on multi-view similarity measurement and comprehensive legal element with sentence-level legal element annotations.
arXiv Detail & Related papers (2023-10-24T08:17:11Z) - LEEC: A Legal Element Extraction Dataset with an Extensive
Domain-Specific Label System [0.4764641468273235]
Legal Element ExtraCtion dataset (LEEC) represents the most extensive and domain-specific legal element extraction dataset for the Chinese legal system.
We introduce a more comprehensive, large-scale criminal element extraction dataset, comprising 15,831 judicial documents and 159 labels.
arXiv Detail & Related papers (2023-10-02T15:16:31Z) - Document Layout Annotation: Database and Benchmark in the Domain of
Public Affairs [62.38140271294419]
We propose a procedure to semi-automatically annotate digital documents with different layout labels.
We collect a novel database for DLA in the public affairs domain using a set of 24 data sources from the Spanish Administration.
The results of our experiments validate the proposed text labeling procedure with accuracy up to 99%.
arXiv Detail & Related papers (2023-06-12T08:21:50Z) - FlairNLP at SemEval-2023 Task 6b: Extraction of Legal Named Entities
from Legal Texts using Contextual String Embeddings [0.0]
We employ knowledge extraction techniques, specially the named entity extraction of legal entities within court case judgements.
We evaluate several state of the art architectures in the realm of sequence labeling using models trained on a curated dataset of legal texts.
A Bi-LSTM model trained on Flair Embeddings achieves the best results.
arXiv Detail & Related papers (2023-06-03T19:38:04Z) - SAILER: Structure-aware Pre-trained Language Model for Legal Case
Retrieval [75.05173891207214]
Legal case retrieval plays a core role in the intelligent legal system.
Most existing language models have difficulty understanding the long-distance dependencies between different structures.
We propose a new Structure-Aware pre-traIned language model for LEgal case Retrieval.
arXiv Detail & Related papers (2023-04-22T10:47:01Z) - Effective Approach to Develop a Sentiment Annotator For Legal Domain in
a Low Resource Setting [0.41783829807634776]
Analyzing the sentiments of legal opinions available in Legal Opinion Texts can facilitate several use cases such as legal judgement prediction, contradictory statements identification and party-based sentiment analysis.
The task of developing a legal domain specific sentiment annotator is challenging due to resource constraints such as lack of domain specific labelled data and domain expertise.
In this study, we propose novel techniques that can be used to develop a sentiment annotator for the legal domain while minimizing the need for manual annotations of data.
arXiv Detail & Related papers (2020-10-31T17:12:32Z) - Knowledge-Aware Procedural Text Understanding with Multi-Stage Training [110.93934567725826]
We focus on the task of procedural text understanding, which aims to comprehend such documents and track entities' states and locations during a process.
Two challenges, the difficulty of commonsense reasoning and data insufficiency, still remain unsolved.
We propose a novel KnOwledge-Aware proceduraL text understAnding (KOALA) model, which effectively leverages multiple forms of external knowledge.
arXiv Detail & Related papers (2020-09-28T10:28:40Z) - Text Recognition in Real Scenarios with a Few Labeled Samples [55.07859517380136]
Scene text recognition (STR) is still a hot research topic in computer vision field.
This paper proposes a few-shot adversarial sequence domain adaptation (FASDA) approach to build sequence adaptation.
Our approach can maximize the character-level confusion between the source domain and the target domain.
arXiv Detail & Related papers (2020-06-22T13:03:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.