Related papers: Scale-free Characteristics of Multilingual Legal Texts and the Limitations of LLMs

Scale-free Characteristics of Multilingual Legal Texts and the Limitations of LLMs

URL: http://arxiv.org/abs/2509.17367v1
Date: Mon, 22 Sep 2025 05:34:15 GMT
Title: Scale-free Characteristics of Multilingual Legal Texts and the Limitations of LLMs
Authors: Haoyang Chen, Kumiko Tanaka-Ishii,
Abstract summary: We quantify linguistic complexity via Heaps' exponent $beta$ (vocabulary growth), Taylor's exponent $alpha$ (word-frequency fluctuation scaling), compression rate $r$ (redundancy), and entropy.<n>We find that legal texts exhibit slower vocabulary growth (lower $beta$) and higher term consistency (higher $alpha$) than general texts.
Score: 10.635248457021497
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We present a comparative analysis of text complexity across domains using scale-free metrics. We quantify linguistic complexity via Heaps' exponent $\beta$ (vocabulary growth), Taylor's exponent $\alpha$ (word-frequency fluctuation scaling), compression rate $r$ (redundancy), and entropy. Our corpora span three domains: legal documents (statutes, cases, deeds) as a specialized domain, general natural language texts (literature, Wikipedia), and AI-generated (GPT) text. We find that legal texts exhibit slower vocabulary growth (lower $\beta$) and higher term consistency (higher $\alpha$) than general texts. Within legal domain, statutory codes have the lowest $\beta$ and highest $\alpha$, reflecting strict drafting conventions, while cases and deeds show higher $\beta$ and lower $\alpha$. In contrast, GPT-generated text shows the statistics more aligning with general language patterns. These results demonstrate that legal texts exhibit domain-specific structures and complexities, which current generative models do not fully replicate.

Related papers

$β$-CLIP: Text-Conditioned Contrastive Learning for Multi-Granular Vision-Language Alignment [53.42377319350806]
$$-CLIP is a multi-granular text-conditioned contrastive learning framework.<n>$$-CAL addresses the semantic overlap inherent in this hierarchy.<n>$$-CLIP establishes a robust, adaptive baseline for dense vision-language correspondence.
arXiv Detail & Related papers (2025-12-14T13:03:20Z)
Domain Regeneration: How well do LLMs match syntactic properties of text domains? [19.04920427362747]
We prompt a commonly used, opensource LLM to regenerate text from two domains of permissively licensed English text -- Wikipedia and news text.<n>This regeneration paradigm allows us to investigate whether LLMs can faithfully match the original human text domains in a semantically-controlled setting.<n>We find that the majority of the regenerated distributions show a shifted mean, a lower standard deviation, and a reduction of the long tail, as compared to the human originals.
arXiv Detail & Related papers (2025-05-12T17:37:17Z)
QUDsim: Quantifying Discourse Similarities in LLM-Generated Text [70.22275200293964]
We introduce an abstraction based on linguistic theories in Questions Under Discussion (QUD) and question semantics to help quantify differences in discourse progression.<n>We then use this framework to build $textbfQUDsim$, a similarity metric that can detect discursive parallels between documents.<n>Using QUDsim, we find that LLMs often reuse discourse structures (more so than humans) across samples, even when content differs.
arXiv Detail & Related papers (2025-04-12T23:46:09Z)
Zero-Shot Statistical Tests for LLM-Generated Text Detection using Finite Sample Concentration Inequalities [13.657259851747126]
Verifying provenance of content is crucial to the function of many organizations, e.g., educational institutions, social media platforms, firms, etc.<n>This problem is becoming increasingly challenging as text generated by Large Language Models (LLMs) becomes almost indistinguishable from human-generated content.<n>In this paper, we answer the following question: Given a piece of text, can we identify whether it was produced by a particular LLM or not?<n>We model LLM-generated text as a sequential process with complete dependence on history. We then design zero-shot statistical tests to distinguish between text generated by two different known sets of LLM
arXiv Detail & Related papers (2025-01-04T23:51:43Z)
Evaluating $n$-Gram Novelty of Language Models Using Rusty-DAWG [63.31836335569654]
We investigate the extent to which modern LMs generate $n$-grams from their training data.<n>We develop Rusty-DAWG, a novel search tool inspired by indexing of genomic data.
arXiv Detail & Related papers (2024-06-18T21:31:19Z)
Semi-Supervised Spoken Language Glossification [101.31035869691462]
Spoken language glossification (SLG) aims to translate the spoken language text into the sign language gloss. We present a framework named $S$emi-$S$upervised $S$poken $L$anguage $G$lossification ($S3$LG) for SLG.
arXiv Detail & Related papers (2024-06-12T13:05:27Z)
Copy Is All You Need [66.00852205068327]
We formulate text generation as progressively copying text segments from an existing text collection. Our approach achieves better generation quality according to both automatic and human evaluations. Our approach attains additional performance gains by simply scaling up to larger text collections.
arXiv Detail & Related papers (2023-07-13T05:03:26Z)
Unsupervised Simplification of Legal Texts [0.0]
We introduce an unsupervised simplification method for legal texts (USLT) USLT performs domain-specific TS by replacing complex words and splitting long sentences. We demonstrate that USLT outperforms state-of-the-art domain-general TS methods in text simplicity while keeping the semantics intact.
arXiv Detail & Related papers (2022-09-01T15:58:12Z)
Language modeling via stochastic processes [30.796382023812022]
Modern language models can generate high-quality short texts, but often meander or are incoherent when generating longer texts. Recent work in self-supervised learning suggests that models can learn good latent representations via contrastive learning. We propose one approach for leveraging constrastive representations, which we call Time Control.
arXiv Detail & Related papers (2022-03-21T22:13:53Z)
How much do language models copy from their training data? Evaluating linguistic novelty in text generation using RAVEN [63.79300884115027]
Current language models can generate high-quality text. Are they simply copying text they have seen before, or have they learned generalizable linguistic abstractions? We introduce RAVEN, a suite of analyses for assessing the novelty of generated text.
arXiv Detail & Related papers (2021-11-18T04:07:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.