Related papers: Clear Minds Think Alike: What Makes LLM Fine-tuning Robust? A Study of Token Perplexity

Related papers

Learn More, Forget Less: A Gradient-Aware Data Selection Approach for LLM [51.21051698747157]
We propose a self-adaptive gradient-aware data selection approach (GrADS) for supervised fine-tuning of large language models (LLMs)<n>Specifically, we design self-guided criteria that leverage the magnitude and statistical distribution of gradients to prioritize examples that contribute the most to the model's learning process.<n>Through extensive experimentation with various LLMs across diverse domains such as medicine, law, and finance, GrADS has demonstrated significant efficiency and cost-effectiveness.
arXiv Detail & Related papers (2025-11-07T08:34:50Z)
TravelBench : Exploring LLM Performance in Low-Resource Domains [2.2917707112773593]
We curated 14 travel-domain datasets spanning 7 common NLP tasks using anonymised data from real-world scenarios.<n>We report on the accuracy, scaling behaviour, and reasoning capabilities of LLMs in a variety of tasks.
arXiv Detail & Related papers (2025-10-03T04:44:34Z)
Enhancing LLM Robustness to Perturbed Instructions: An Empirical Study [8.827173113748701]
We study character- and word-level edits of task-specific instructions, which substantially degrade downstream performance.<n>We find that, on average, self-denoising achieves substantially higher performance gains than alternative strategies.
arXiv Detail & Related papers (2025-04-03T16:17:56Z)
An Analyst-Inspector Framework for Evaluating Reproducibility of LLMs in Data Science [5.064778712920176]
Large Language Models (LLMs) have demonstrated potential for data science tasks via code generation. We propose a novel analyst-inspector framework to automatically evaluate and enforce the of LLM-generated data science.
arXiv Detail & Related papers (2025-02-23T01:15:50Z)
ThinkBench: Dynamic Out-of-Distribution Evaluation for Robust LLM Reasoning [61.750373974799366]
ThinkBench is an evaluation framework for large language models (LLMs) It unifies the evaluation of reasoning models and non-reasoning models. ThinkBench effectively provides a reliable evaluation of LLMs and reduces the impact of data contamination.
arXiv Detail & Related papers (2025-02-22T15:41:51Z)
Understanding and Mitigating the Bias Inheritance in LLM-based Data Augmentation on Downstream Tasks [24.706895491806794]
This work presents the first systematic investigation in understanding, analyzing, and mitigating bias inheritance.<n>We analyze how 6 different types of biases manifest at varying bias ratios.<n>We propose three mitigation strategies: token-based, mask-based, and loss-based approaches.
arXiv Detail & Related papers (2025-02-06T15:20:58Z)
RLS3: RL-Based Synthetic Sample Selection to Enhance Spatial Reasoning in Vision-Language Models for Indoor Autonomous Perception [20.01853641155509]
Vision-language model (VLM) fine-tuning for application-specific visual grounding based on natural language instructions has become one of the most popular approaches for learning-enabled autonomous systems. We propose a new generalizable framework to improve VLM fine-tuning by integrating it with a reinforcement learning (RL) agent.
arXiv Detail & Related papers (2025-01-31T04:30:42Z)
LLM-Forest: Ensemble Learning of LLMs with Graph-Augmented Prompts for Data Imputation [37.14344322899091]
Large language models (LLMs), trained on vast corpora, have shown strong potential in data generation. We propose a novel framework, LLM-Forest, which introduces a "forest" of few-shot learning LLM "trees" with confidence-based weighted voting. This framework is established on a new concept of bipartite information graphs to identify high-quality relevant neighboring entries.
arXiv Detail & Related papers (2024-10-28T20:42:46Z)
LLM Surgery: Efficient Knowledge Unlearning and Editing in Large Language Models [16.67999382790238]
Large language models (LLMs) have revolutionized various domains, yet their utility comes with challenges related to outdated or problematic knowledge embedded during pretraining. This paper addresses the challenge of modifying LLMs to unlearn problematic and outdated information while efficiently integrating new knowledge without retraining from scratch. Using Llama2-7B, we demonstrate that LLM Surgery can achieve significant forgetting on the unlearn set, a 20% increase in accuracy on the update set, and maintain performance on the retain set.
arXiv Detail & Related papers (2024-09-19T19:07:01Z)
Unveiling the Vulnerability of Private Fine-Tuning in Split-Based Frameworks for Large Language Models: A Bidirectionally Enhanced Attack [20.727726850786386]
BiSR is the first data reconstruction attack designed to target both the forward and backward propagation processes of split learning (SL) We propose BiSR, the first data reconstruction attack (DRA) designed to target both the forward and backward propagation processes of SL.
arXiv Detail & Related papers (2024-09-02T06:01:20Z)
Enhancing Temporal Understanding in LLMs for Semi-structured Tables [50.59009084277447]
We conduct a comprehensive analysis of temporal datasets to pinpoint the specific limitations of large language models (LLMs) Our investigation leads to enhancements in TempTabQA, a dataset specifically designed for temporal temporal question answering. We introduce a novel approach, C.L.E.A.R. to strengthen LLM capabilities in this domain.
arXiv Detail & Related papers (2024-07-22T20:13:10Z)
Learning on Graphs with Large Language Models(LLMs): A Deep Dive into Model Robustness [39.57155321515097]
Large Language Models (LLMs) have demonstrated remarkable performance across various natural language processing tasks. It remains unclear whether LLMs exhibit robustness in learning on graphs.
arXiv Detail & Related papers (2024-07-16T09:05:31Z)
SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts. We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM. We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z)
PISTOL: Dataset Compilation Pipeline for Structural Unlearning of LLMs [31.16117964915814]
Machine unlearning, which seeks to erase specific data stored in the pre-trained or fine-tuned models, has emerged as a crucial protective measure for LLMs. To facilitate the development of structural unlearning methods, we propose PISTOL, a pipeline for compiling multi-scenario datasets. We conduct benchmarks with four distinct unlearning methods on both Llama2-7B and Mistral-7B models.
arXiv Detail & Related papers (2024-06-24T17:22:36Z)
Comprehensive Reassessment of Large-Scale Evaluation Outcomes in LLMs: A Multifaceted Statistical Approach [64.42462708687921]
Evaluations have revealed that factors such as scaling, training types, architectures and other factors profoundly impact the performance of LLMs. Our study embarks on a thorough re-examination of these LLMs, targeting the inadequacies in current evaluation methods. This includes the application of ANOVA, Tukey HSD tests, GAMM, and clustering technique.
arXiv Detail & Related papers (2024-03-22T14:47:35Z)
LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement [79.31084387589968]
Pretrained large language models (LLMs) are currently state-of-the-art for solving the vast majority of natural language processing tasks. We propose LLM2LLM, a data augmentation strategy that uses a teacher LLM to enhance a small seed dataset. We achieve improvements up to 24.2% on the GSM8K dataset, 32.6% on CaseHOLD, 32.0% on SNIPS, 52.6% on TREC and 39.8% on SST-2 over regular fine-tuning in the low-data regime.
arXiv Detail & Related papers (2024-03-22T08:57:07Z)
Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs. We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z)
Accelerating LLaMA Inference by Enabling Intermediate Layer Decoding via Instruction Tuning with LITE [62.13435256279566]
Large Language Models (LLMs) have achieved remarkable performance across a wide variety of natural language tasks. However, their large size makes their inference slow and computationally expensive. We show that it enables these layers to acquire 'good' generation ability without affecting the generation ability of the final layer.
arXiv Detail & Related papers (2023-10-28T04:07:58Z)
TRACE: A Comprehensive Benchmark for Continual Learning in Large Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety. Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs. We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z)
How Good Are LLMs at Out-of-Distribution Detection? [13.35571704613836]
Out-of-distribution (OOD) detection plays a vital role in enhancing the reliability of machine learning (ML) models. This paper embarks on a pioneering empirical investigation of OOD detection in the domain of large language models (LLMs)
arXiv Detail & Related papers (2023-08-20T13:15:18Z)
Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis, and LLMs Evaluations [111.88727295707454]
This paper reexamines the research on out-of-distribution (OOD) robustness in the field of NLP. We propose a benchmark construction protocol that ensures clear differentiation and challenging distribution shifts. We conduct experiments on pre-trained language models for analysis and evaluation of OOD robustness.
arXiv Detail & Related papers (2023-06-07T17:47:03Z)
To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [50.31589712761807]
Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs. We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting. Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
arXiv Detail & Related papers (2023-05-22T17:02:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.