Retrieving Semantically Similar Decisions under Noisy Institutional Labels: Robust Comparison of Embedding Methods
- URL: http://arxiv.org/abs/2512.05681v1
- Date: Fri, 05 Dec 2025 12:54:26 GMT
- Title: Retrieving Semantically Similar Decisions under Noisy Institutional Labels: Robust Comparison of Embedding Methods
- Authors: Tereza Novotna, Jakub Harasta,
- Abstract summary: General-purpose embedder (OpenAI) outperforms domain-specific BERT-trained from scratch on 30,000 decisions.<n>Our framework is robust enough to be used for evaluation under a noisy gold dataset.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Retrieving case law is a time-consuming task predominantly carried out by querying databases. We provide a comparison of two models in three different settings for Czech Constitutional Court decisions: (i) a large general-purpose embedder (OpenAI), (ii) a domain-specific BERT-trained from scratch on ~30,000 decisions using sliding windows and attention pooling. We propose a noise-aware evaluation including IDF-weighted keyword overlap as graded relevance, binarization via two thresholds (0.20 balanced, 0.28 strict), significance via paired bootstrap, and an nDCG diagnosis supported with qualitative analysis. Despite modest absolute nDCG (expected under noisy labels), the general OpenAI embedder decisively outperforms the domain pre-trained BERT in both settings at @10/@20/@100 across both thresholds; differences are statistically significant. Diagnostics attribute low absolutes to label drift and strong ideals rather than lack of utility. Additionally, our framework is robust enough to be used for evaluation under a noisy gold dataset, which is typical when handling data with heterogeneous labels stemming from legacy judicial databases.
Related papers
- DWBench: Holistic Evaluation of Watermark for Dataset Copyright Auditing [43.881484429055654]
dataset watermark technique holds promise for auditing and verifying usage.<n>We develop DWBench, a unified benchmark and open-source toolkit for systematically evaluating image dataset watermark techniques.<n>We present the results of two new metrics: sample significance for fine-grained watermark distinguishability and verification success rate for dataset-level auditing.
arXiv Detail & Related papers (2026-02-14T01:09:19Z) - TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them [58.04324690859212]
Large Language Models (LLMs) as automated evaluators (LLM-as-a-judge) has revealed critical inconsistencies in current evaluation frameworks.<n>We identify two fundamental types of inconsistencies: Score-Comparison Inconsistency and Pairwise Transitivity Inconsistency.<n>We propose TrustJudge, a probabilistic framework that addresses these limitations through two key innovations.
arXiv Detail & Related papers (2025-09-25T13:04:29Z) - SEOE: A Scalable and Reliable Semantic Evaluation Framework for Open Domain Event Detection [70.23196257213829]
We propose a scalable and reliable Semantic-level Evaluation framework for Open domain Event detection.<n>Our proposed framework first constructs a scalable evaluation benchmark that currently includes 564 event types covering 7 major domains.<n>We then leverage large language models (LLMs) as automatic evaluation agents to compute a semantic F1-score, incorporating fine-grained definitions of semantically similar labels.
arXiv Detail & Related papers (2025-03-05T09:37:05Z) - A robust three-way classifier with shadowed granular-balls based on justifiable granularity [53.39844791923145]
We construct a robust three-way classifier with shadowed GBs for uncertain data.
Our model demonstrates in managing uncertain data and effectively mitigates classification risks.
arXiv Detail & Related papers (2024-07-03T08:54:45Z) - Drawing the Same Bounding Box Twice? Coping Noisy Annotations in Object
Detection with Repeated Labels [6.872072177648135]
We propose a novel localization algorithm that adapts well-established ground truth estimation methods.
Our algorithm also shows superior performance during training on the TexBiG dataset.
arXiv Detail & Related papers (2023-09-18T13:08:44Z) - Guiding Pseudo-labels with Uncertainty Estimation for Test-Time
Adaptation [27.233704767025174]
Test-Time Adaptation (TTA) is a specific case of Unsupervised Domain Adaptation (UDA) where a model is adapted to a target domain without access to source data.
We propose a novel approach for the TTA setting based on a loss reweighting strategy that brings robustness against the noise that inevitably affects the pseudo-labels.
arXiv Detail & Related papers (2023-03-07T10:04:55Z) - Lifting Weak Supervision To Structured Prediction [12.219011764895853]
Weak supervision (WS) is a rich set of techniques that produce pseudolabels by aggregating easily obtained but potentially noisy label estimates.
We introduce techniques new to weak supervision based on pseudo-Euclidean embeddings and tensor decompositions.
Several of our results, which can be viewed as robustness guarantees in structured prediction with noisy labels, may be of independent interest.
arXiv Detail & Related papers (2022-11-24T02:02:58Z) - S3: Supervised Self-supervised Learning under Label Noise [53.02249460567745]
In this paper we address the problem of classification in the presence of label noise.
In the heart of our method is a sample selection mechanism that relies on the consistency between the annotated label of a sample and the distribution of the labels in its neighborhood in the feature space.
Our method significantly surpasses previous methods on both CIFARCIFAR100 with artificial noise and real-world noisy datasets such as WebVision and ANIMAL-10N.
arXiv Detail & Related papers (2021-11-22T15:49:20Z) - A Framework for Cluster and Classifier Evaluation in the Absence of
Reference Labels [23.658440146240025]
We propose a supplement to using reference labels, which we call an approximate ground truth refinement (AGTR)
We prove that bounds on specific metrics used to evaluate clustering algorithms can be computed without reference labels.
We also introduce a procedure that uses an AGTR to identify inaccurate evaluation results produced from datasets of dubious quality.
arXiv Detail & Related papers (2021-09-23T03:42:01Z) - BiSTF: Bilateral-Branch Self-Training Framework for Semi-Supervised
Large-scale Fine-Grained Recognition [28.06659482245647]
Semi-supervised Fine-Grained Recognition is a challenge task due to data imbalance, high interclass similarity and domain mismatch.
We propose Bilateral-Branch Self-Training Framework (BiSTF) to improve existing semi-balanced and domain-shifted fine-grained data.
We show BiSTF outperforms the existing state-of-the-art SSL on Semi-iNat dataset.
arXiv Detail & Related papers (2021-07-14T15:28:54Z) - Approximating Instance-Dependent Noise via Instance-Confidence Embedding [87.65718705642819]
Label noise in multiclass classification is a major obstacle to the deployment of learning systems.
We investigate the instance-dependent noise (IDN) model and propose an efficient approximation of IDN to capture the instance-specific label corruption.
arXiv Detail & Related papers (2021-03-25T02:33:30Z) - Joint Visual and Temporal Consistency for Unsupervised Domain Adaptive
Person Re-Identification [64.37745443119942]
This paper jointly enforces visual and temporal consistency in the combination of a local one-hot classification and a global multi-class classification.
Experimental results on three large-scale ReID datasets demonstrate the superiority of proposed method in both unsupervised and unsupervised domain adaptive ReID tasks.
arXiv Detail & Related papers (2020-07-21T14:31:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.