A document processing pipeline for the construction of a dataset for topic modeling based on the judgments of the Italian Supreme Court
- URL: http://arxiv.org/abs/2505.08439v1
- Date: Tue, 13 May 2025 11:06:24 GMT
- Title: A document processing pipeline for the construction of a dataset for topic modeling based on the judgments of the Italian Supreme Court
- Authors: Matteo Marulli, Glauco Panattoni, Marco Bertini,
- Abstract summary: We develop a document processing pipeline that produces an anonymized dataset optimized for topic modeling.<n>The pipeline integrates document layout analysis (YOLOv8x), optical character recognition, and text anonymization.<n>Compared to OCR-only methods, our dataset improved topic modeling with a diversity score of 0.6198 and a coherence score of 0.6638.
- Score: 5.612141846711729
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Topic modeling in Italian legal research is hindered by the lack of public datasets, limiting the analysis of legal themes in Supreme Court judgments. To address this, we developed a document processing pipeline that produces an anonymized dataset optimized for topic modeling. The pipeline integrates document layout analysis (YOLOv8x), optical character recognition, and text anonymization. The DLA module achieved a mAP@50 of 0.964 and a mAP@50-95 of 0.800. The OCR detector reached a mAP@50-95 of 0.9022, and the text recognizer (TrOCR) obtained a character error rate of 0.0047 and a word error rate of 0.0248. Compared to OCR-only methods, our dataset improved topic modeling with a diversity score of 0.6198 and a coherence score of 0.6638. We applied BERTopic to extract topics and used large language models to generate labels and summaries. Outputs were evaluated against domain expert interpretations. Claude Sonnet 3.7 achieved a BERTScore F1 of 0.8119 for labeling and 0.9130 for summarization.
Related papers
- MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm [60.14048367611333]
MonkeyOCR is a vision-language model for document parsing.<n>It advances the state of the art by leveraging a Structure-Recognition-Relation (SRR) triplet paradigm.
arXiv Detail & Related papers (2025-06-05T16:34:57Z) - Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features [57.34477506004105]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation.<n>We introduce novel methodologies and datasets to overcome these challenges.<n>We propose MhBART, an encoder-decoder model designed to emulate human writing style.<n>We also propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features.
arXiv Detail & Related papers (2024-12-17T08:47:41Z) - Reference-Based Post-OCR Processing with LLM for Precise Diacritic Text in Historical Document Recognition [1.6941039309214678]
We propose a method utilizing available content-focused ebooks as a reference base to correct imperfect OCR-generated text.<n>This technique generates high-precision pseudo-page-to-page labels for diacritic languages.<n>The pipeline eliminates various types of noise from aged documents and addresses issues such as missing characters, words, and disordered sequences.
arXiv Detail & Related papers (2024-10-17T08:05:02Z) - LLMs Can Patch Up Missing Relevance Judgments in Evaluation [56.51461892988846]
We use large language models (LLMs) to automatically label unjudged documents.
We simulate scenarios with varying degrees of holes by randomly dropping relevant documents from the relevance judgment in TREC DL tracks.
Our method achieves a Kendall tau correlation of 0.87 and 0.92 on an average for Vicuna-7B and GPT-3.5 Turbo respectively.
arXiv Detail & Related papers (2024-05-08T00:32:19Z) - LOCR: Location-Guided Transformer for Optical Character Recognition [55.195165959662795]
We propose LOCR, a model that integrates location guiding into the transformer architecture during autoregression.
We train the model on a dataset comprising over 77M text-location pairs from 125K academic document pages, including bounding boxes for words, tables and mathematical symbols.
It outperforms all existing methods in our test set constructed from arXiv, as measured by edit distance, BLEU, METEOR and F-measure.
arXiv Detail & Related papers (2024-03-04T15:34:12Z) - ASPIRO: Any-shot Structured Parsing-error-Induced ReprOmpting for
Consistent Data-to-Text Generation [0.0]
ASPIRO is an approach for structured data verbalisation into short template sentences in zero to few-shot settings.
Unlike previous methods, our approach prompts large language models to directly produce entity-agnostic templates.
arXiv Detail & Related papers (2023-10-27T03:39:51Z) - Text2Topic: Multi-Label Text Classification System for Efficient Topic
Detection in User Generated Content with Zero-Shot Capabilities [2.7311827519141363]
We propose Text to Topic (Text2Topic), which achieves high multi-label classification performance.
Text2Topic supports zero-shot predictions, produces domain-specific text embeddings, and enables production-scale batch-inference.
The model is deployed on a real-world stream processing platform, and it outperforms other models with 92.9% micro mAP.
arXiv Detail & Related papers (2023-10-23T11:33:24Z) - OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text
Documents [122.55393759474181]
We introduce OBELICS, an open web-scale filtered dataset of interleaved image-text documents.
We describe the dataset creation process, present comprehensive filtering rules, and provide an analysis of the dataset's content.
We train vision and language models of 9 and 80 billion parameters named IDEFICS, and obtain competitive performance on different multimodal benchmarks.
arXiv Detail & Related papers (2023-06-21T14:01:01Z) - DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability
Curvature [143.5381108333212]
We show that text sampled from an large language model tends to occupy negative curvature regions of the model's log probability function.
We then define a new curvature-based criterion for judging if a passage is generated from a given LLM.
We find DetectGPT is more discriminative than existing zero-shot methods for model sample detection.
arXiv Detail & Related papers (2023-01-26T18:44:06Z) - PART: Pre-trained Authorship Representation Transformer [52.623051272843426]
Authors writing documents imprint identifying information within their texts.<n>Previous works use hand-crafted features or classification tasks to train their authorship models.<n>We propose a contrastively trained model fit to learn textbfauthorship embeddings instead of semantics.
arXiv Detail & Related papers (2022-09-30T11:08:39Z) - Large Scale Legal Text Classification Using Transformer Models [0.0]
We study the performance of transformer-based models in combination with strategies such as generative pretraining, gradual unfreezing and discriminative learning rates.
WeLEX quantify the impact of individual steps, such as language model fine-tuning or gradual unfreezing in an ablation study.
arXiv Detail & Related papers (2020-10-24T11:03:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.