Related papers: Text to Trust: Evaluating Fine-Tuning and LoRA Trade-offs in Language Models for Unfair Terms of Service Detection

Text to Trust: Evaluating Fine-Tuning and LoRA Trade-offs in Language Models for Unfair Terms of Service Detection

URL: http://arxiv.org/abs/2510.22531v1
Date: Sun, 26 Oct 2025 04:46:06 GMT
Title: Text to Trust: Evaluating Fine-Tuning and LoRA Trade-offs in Language Models for Unfair Terms of Service Detection
Authors: Noshitha Padma Pratyusha Juttu, Sahithi Singireddy, Sravani Gona, Sujal Timilsina,
Abstract summary: Large Language Models (LLMs) have transformed text understanding, yet their adaptation to specialized legal domains remains constrained by the cost of full fine-tuning.<n>This study provides a systematic evaluation of fine tuning, parameter efficient adaptation (LoRA, QLoRA), and zero-shot prompting strategies for unfair clause detection in Terms of Service (ToS) documents.<n> Experiments on the CLAUDETTE-ToS benchmark and the Multilingual Scraper Corpus show that full fine-tuning achieves the strongest precision recall balance, while LoRA-based models provide competitive recall with up to 3x lower memory cost.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have transformed text understanding, yet their adaptation to specialized legal domains remains constrained by the cost of full fine-tuning. This study provides a systematic evaluation of fine tuning, parameter efficient adaptation (LoRA, QLoRA), and zero-shot prompting strategies for unfair clause detection in Terms of Service (ToS) documents, a key application in legal NLP. We finetune BERT and DistilBERT, apply 4-bit Low-Rank Adaptation (LoRA) to models such as TinyLlama, LLaMA 3B/7B, and SaulLM, and evaluate GPT-4o and O-versions in zero-shot settings. Experiments on the CLAUDETTE-ToS benchmark and the Multilingual Scraper Corpus show that full fine-tuning achieves the strongest precision recall balance, while LoRA-based models provide competitive recall with up to 3x lower memory cost. These findings highlight practical design trade-offs for efficient and domain-adapted LLMs, contributing open baselines for fine-tuning research in legal text processing.

Related papers

Parameter Efficient Fine Tuning Llama 3.1 for Answering Arabic Legal Questions: A Case Study on Jordanian Laws [0.5708902722746041]
This study uses Jordanian law as a case study to explore the fine-tuning of the Llama-3.1 large language model for Arabic question-answering.<n>Two versions of the model were fine-tuned using parameter-efficient fine-tuning (PEFT) with LoRA adapters and 4-bit quantized models.<n>Results demonstrated improved legal reasoning and accuracy while achieving resource efficiency.
arXiv Detail & Related papers (2026-01-24T08:17:26Z)
Accuracy and Efficiency Trade-Offs in LLM-Based Malware Detection and Explanation: A Comparative Study of Parameter Tuning vs. Full Fine-Tuning [0.0]
Low-Rank Adaptation (LoRA) fine-tuned Large Language Models (LLMs) can approximate the performance of fully fine-tuned models in generating human-interpretable decisions and explanations for malware classification.<n>LoRA offers a practical balance of interpretability and resource efficiency, enabling deployment in resource-constrained environments without sacrificing explanation quality.
arXiv Detail & Related papers (2025-11-24T19:37:13Z)
All for law and law for all: Adaptive RAG Pipeline for Legal Research [0.8819595592190884]
Retrieval-Augmented Generation (RAG) has transformed how we approach text generation tasks.<n>This work introduces a novel end-to-end RAG pipeline that improves upon previous baselines.
arXiv Detail & Related papers (2025-08-18T17:14:03Z)
ContractEval: Benchmarking LLMs for Clause-Level Legal Risk Identification in Commercial Contracts [21.217188970086344]
The potential of large language models (LLMs) in specialized domains such as legal risk analysis remains underexplored.<n>This paper introduces ContractEval, the first benchmark to thoroughly evaluate whether open-source LLMs could match proprietary LLMs in identifying clause-level legal risks in commercial contracts.
arXiv Detail & Related papers (2025-08-05T04:53:05Z)
Towards Automated Fact-Checking of Real-World Claims: Exploring Task Formulation and Assessment with LLMs [32.45604456988931]
This study establishes baseline comparisons for Automated Fact-Checking (AFC) using Large Language Models (LLMs)<n>We evaluate Llama-3 models of varying sizes on 17,856 claims collected from PolitiFact (2007-2024) using evidence retrieved via restricted web searches.<n>Our results show that larger LLMs consistently outperform smaller LLMs in classification accuracy and justification quality without fine-tuning.
arXiv Detail & Related papers (2025-02-13T02:51:17Z)
LoRA Done RITE: Robust Invariant Transformation Equilibration for LoRA Optimization [78.93425154518705]
Low-rank adaption (LoRA) is a widely used parameter-efficient finetuning method for LLM that reduces memory requirements.<n>This paper introduces LoRA-RITE, a novel adaptive matrix preconditioning method for LoRA optimization.
arXiv Detail & Related papers (2024-10-27T22:57:12Z)
SFR-RAG: Towards Contextually Faithful LLMs [57.666165819196486]
Retrieval Augmented Generation (RAG) is a paradigm that integrates external contextual information with large language models (LLMs) to enhance factual accuracy and relevance. We introduce SFR-RAG, a small LLM that is instruction-textual with an emphasis on context-grounded generation and hallucination. We also present ConBench, a new evaluation framework compiling multiple popular and diverse RAG benchmarks.
arXiv Detail & Related papers (2024-09-16T01:08:18Z)
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal [64.9938658716425]
SORRY-Bench is a proposed benchmark for evaluating large language models' (LLMs) ability to recognize and reject unsafe user requests.<n>First, existing methods often use coarse-grained taxonomy of unsafe topics, and are over-representing some fine-grained topics.<n>Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which are only implicitly considered in many evaluations.
arXiv Detail & Related papers (2024-06-20T17:56:07Z)
Investigating Training Strategies and Model Robustness of Low-Rank Adaptation for Language Modeling in Speech Recognition [27.515920408920216]
Low-rank adaptation (LoRA) with frozen pretrained language models (PLMs) is a resource-efficient modeling approach for memory-constrained hardware. In this study, we explore how to enhance model performance by introducing various LoRA training strategies. To further characterize the stability of LoRA-based second-pass speech recognition models, we examine against input perturbations.
arXiv Detail & Related papers (2024-01-19T01:30:16Z)
SCALE: Synergized Collaboration of Asymmetric Language Translation Engines [105.8983433641208]
We introduce a collaborative framework that connects compact Specialized Translation Models (STMs) and general-purpose Large Language Models (LLMs) as one unified translation engine. By introducing translation from STM into the triplet in-context demonstrations, SCALE unlocks refinement and pivoting ability of LLM. Our experiments show that SCALE significantly outperforms both few-shot LLMs (GPT-4) and specialized models (NLLB) in challenging low-resource settings.
arXiv Detail & Related papers (2023-09-29T08:46:38Z)
Exploring Small Language Models with Prompt-Learning Paradigm for Efficient Domain-Specific Text Classification [2.410463233396231]
Small language models (SLMs) offer significant customizability, adaptability, and cost-effectiveness for domain-specific tasks. In few-shot settings when prompt-based model fine-tuning is possible, T5-base, a typical SLM with 220M parameters, achieve approximately 75% accuracy with limited labeled data. In zero-shot settings with a fixed model, we underscore a pivotal observation that, although the GPT-3.5-turbo equipped with around 154B parameters garners an accuracy of 55.16%, the power of well designed prompts becomes evident.
arXiv Detail & Related papers (2023-09-26T09:24:46Z)
SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore [159.21914121143885]
We present SILO, a new language model that manages this risk-performance tradeoff during inference. SILO is built by (1) training a parametric LM on Open License Corpus (OLC), a new corpus we curate with 228B tokens of public domain and permissively licensed text. Access to the datastore greatly improves out of domain performance, closing 90% of the performance gap with an LM trained on the Pile.
arXiv Detail & Related papers (2023-08-08T17:58:15Z)
Prompting GPT-3 To Be Reliable [117.23966502293796]
This work decomposes reliability into four facets: generalizability, fairness, calibration, and factuality. We find that GPT-3 outperforms smaller-scale supervised models by large margins on all these facets.
arXiv Detail & Related papers (2022-10-17T14:52:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.