Related papers: LoRIF: Low-Rank Influence Functions for Scalable Training Data Attribution

LoRIF: Low-Rank Influence Functions for Scalable Training Data Attribution

URL: http://arxiv.org/abs/2601.21929v1
Date: Thu, 29 Jan 2026 16:18:34 GMT
Title: LoRIF: Low-Rank Influence Functions for Scalable Training Data Attribution
Authors: Shuangqi Li, Hieu Le, Jingyi Xu, Mathieu Salzmann,
Abstract summary: Training data attribution identifies which training examples most influenced a model's prediction.<n>LoRIF exploits low-rank structures of gradient to address both bottlenecks.<n>On models from 0.1B to 70B parameters trained on datasets with millions of examples, LoRIF achieves up to 20$times$ storage reduction and query-time speedup.
Score: 62.830878652285406
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Training data attribution (TDA) identifies which training examples most influenced a model's prediction. The best-performing TDA methods exploits gradients to define an influence function. To overcome the scalability challenge arising from gradient computation, the most popular strategy is random projection (e.g., TRAK, LoGRA). However, this still faces two bottlenecks when scaling to large training sets and high-quality attribution: \emph{(i)} storing and loading projected per-example gradients for all $N$ training examples, where query latency is dominated by I/O; and \emph{(ii)} forming the $D \times D$ inverse Hessian approximation, which costs $O(D^2)$ memory. Both bottlenecks scale with the projection dimension $D$, yet increasing $D$ is necessary for attribution quality -- creating a quality--scalability tradeoff. We introduce \textbf{LoRIF (Low-Rank Influence Functions)}, which exploits low-rank structures of gradient to address both bottlenecks. First, we store rank-$c$ factors of the projected per-example gradients rather than full matrices, reducing storage and query-time I/O from $O(D)$ to $O(c\sqrt{D})$ per layer per sample. Second, we use truncated SVD with the Woodbury identity to approximate the Hessian term in an $r$-dimensional subspace, reducing memory from $O(D^2)$ to $O(Dr)$. On models from 0.1B to 70B parameters trained on datasets with millions of examples, LoRIF achieves up to 20$\times$ storage reduction and query-time speedup compared to LoGRA, while matching or exceeding its attribution quality. LoRIF makes gradient-based TDA practical at frontier scale.

Related papers

FRUGAL: Memory-Efficient Optimization by Reducing State Overhead for Scalable Training [51.39495282347475]
We introduce $textttFRUGAL$ ($textbfF$ull-$textbfR$ank $textbfU$pdates with $textbfG$r$textbfA$dient sp$textbfL$itting, a new memory-efficient optimization framework.<n>Our framework can be integrated with various low-rank update selection techniques, including GaLore and BAdam.
arXiv Detail & Related papers (2024-11-12T14:41:07Z)
Differential Private Stochastic Optimization with Heavy-tailed Data: Towards Optimal Rates [15.27596975662702]
We explore algorithms achieving optimal rates of DP optimization with heavy-tailed gradients. Our results match the minimax lower bound in citekamath2022, indicating that the theoretical limit of convex optimization under DP is achievable.
arXiv Detail & Related papers (2024-08-19T11:07:05Z)
Online Learning and Information Exponents: On The Importance of Batch size, and Time/Complexity Tradeoffs [24.305423716384272]
We study the impact of the batch size on the iteration time $T$ of training two-layer neural networks with one-pass gradient descent (SGD) We show that performing gradient updates with large batches minimizes the training time without changing the total sample complexity. We show that one can track the training progress by a system of low-dimensional ordinary differential equations (ODEs)
arXiv Detail & Related papers (2024-06-04T09:44:49Z)
Second-order Information Promotes Mini-Batch Robustness in Variance-Reduced Gradients [0.196629787330046]
We show that incorporating partial second-order information of the objective function can dramatically improve robustness to mini-batch size of variance-reduced gradient methods. We demonstrate this phenomenon on a prototypical Newton ($textttMb-SVRN$) algorithm.
arXiv Detail & Related papers (2024-04-23T05:45:52Z)
Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance. Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z)
Variance-reduced Clipping for Non-convex Optimization [24.765794811146144]
Gradient clipping is a technique used in deep learning applications such as large-scale language modeling. Recent experimental training have a fairly special behavior in that it mitigates order complexity.
arXiv Detail & Related papers (2023-03-02T00:57:38Z)
Coordinating Distributed Example Orders for Provably Accelerated Training [39.05759866984658]
We propose Coordinated Distributed GraB (CD-GraB) to translate the benefits of permutation-based example ordering to distributed settings. With negligible overhead, CD-GraB exhibits a linear speedup in convergence rate over centralized GraB and outperforms distributed RR on a variety of benchmark tasks.
arXiv Detail & Related papers (2023-02-02T03:15:29Z)
Asynchronous Training Schemes in Distributed Learning with Time Delay [17.259708772713164]
In the context of distributed deep learning, the issue of stale weights or gradients could result in poor algorithmic performance. In this paper, we propose a different approach to tackle the issue of stale weights or gradients. One practical variant of PC-ASGD is also proposed by adopting a condition to help with the determination of the tradeoff parameter.
arXiv Detail & Related papers (2022-08-28T07:14:59Z)
Large Scale Private Learning via Low-rank Reparametrization [77.38947817228656]
We propose a reparametrization scheme to address the challenges of applying differentially private SGD on large neural networks. We are the first able to apply differential privacy on the BERT model and achieve an average accuracy of $83.9%$ on four downstream tasks.
arXiv Detail & Related papers (2021-06-17T10:14:43Z)
Beyond Lazy Training for Over-parameterized Tensor Decomposition [69.4699995828506]
We show that gradient descent on over-parametrized objective could go beyond the lazy training regime and utilize certain low-rank structure in the data. Our results show that gradient descent on over-parametrized objective could go beyond the lazy training regime and utilize certain low-rank structure in the data.
arXiv Detail & Related papers (2020-10-22T00:32:12Z)
Differentially Quantized Gradient Methods [53.3186247068836]
We show that Differentially Quantized Gradient Descent (DQ-GD) attains a linear contraction factor of $maxsigma_mathrmGD, rhon 2-R$. No algorithm within a certain class can converge faster than $maxsigma_mathrmGD, 2-R$.
arXiv Detail & Related papers (2020-02-06T20:40:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.