FASTTRACK: Fast and Accurate Fact Tracing for LLMs
- URL: http://arxiv.org/abs/2404.15157v1
- Date: Mon, 22 Apr 2024 00:07:55 GMT
- Title: FASTTRACK: Fast and Accurate Fact Tracing for LLMs
- Authors: Si Chen, Feiyang Kang, Ning Yu, Ruoxi Jia,
- Abstract summary: This paper introduces FASTTRACK, a novel approach that harnesses the capabilities of Large Language Models (LLMs) to validate supportive evidence for queries.
Our experiments show that FASTTRACK substantially outperforms existing methods in both accuracy and efficiency.
- Score: 26.476665624884134
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Fact tracing seeks to identify specific training examples that serve as the knowledge source for a given query. Existing approaches to fact tracing rely on assessing the similarity between each training sample and the query along a certain dimension, such as lexical similarity, gradient, or embedding space. However, these methods fall short of effectively distinguishing between samples that are merely relevant and those that actually provide supportive evidence for the information sought by the query. This limitation often results in suboptimal effectiveness. Moreover, these approaches necessitate the examination of the similarity of individual training points for each query, imposing significant computational demands and creating a substantial barrier for practical applications. This paper introduces FASTTRACK, a novel approach that harnesses the capabilities of Large Language Models (LLMs) to validate supportive evidence for queries and at the same time clusters the training database towards a reduced extent for LLMs to trace facts. Our experiments show that FASTTRACK substantially outperforms existing methods in both accuracy and efficiency, achieving more than 100\% improvement in F1 score over the state-of-the-art methods while being X33 faster than \texttt{TracIn}.
Related papers
- SoftDedup: an Efficient Data Reweighting Method for Speeding Up Language Model Pre-training [12.745160748376794]
We propose a soft deduplication method that maintains dataset integrity while selectively reducing the sampling weight of data with high commonness.
Central to our approach is the concept of "data commonness", a metric we introduce to quantify the degree of duplication.
Empirical analysis shows that this method significantly improves training efficiency, achieving comparable perplexity scores with at least a 26% reduction in required training steps.
arXiv Detail & Related papers (2024-07-09T08:26:39Z) - Learning from Litigation: Graphs and LLMs for Retrieval and Reasoning in eDiscovery [6.037276428689637]
This paper introduces DISCOvery Graph (DISCOG), a hybrid approach that combines the strengths of two worlds: a graph-based method for accurate document relevance prediction.
Our approach drastically reduces document review costs by 99.9% compared to manual processes and by 95% compared to LLM-based classification methods.
arXiv Detail & Related papers (2024-05-29T15:08:55Z) - A Fixed-Point Approach to Unified Prompt-Based Counting [51.20608895374113]
This paper aims to establish a comprehensive prompt-based counting framework capable of generating density maps for objects indicated by various prompt types, such as box, point, and text.
Our model excels in prominent class-agnostic datasets and exhibits superior performance in cross-dataset adaptation tasks.
arXiv Detail & Related papers (2024-03-15T12:05:44Z) - Rethinking Classifier Re-Training in Long-Tailed Recognition: A Simple
Logits Retargeting Approach [102.0769560460338]
We develop a simple logits approach (LORT) without the requirement of prior knowledge of the number of samples per class.
Our method achieves state-of-the-art performance on various imbalanced datasets, including CIFAR100-LT, ImageNet-LT, and iNaturalist 2018.
arXiv Detail & Related papers (2024-03-01T03:27:08Z) - Stochastic Amortization: A Unified Approach to Accelerate Feature and
Data Attribution [67.28273187033693]
We show that training a network that directly predicts the desired output, known as amortization, is inexpensive and surprisingly effective.
This approach significantly accelerates several feature attribution and data valuation methods, often yielding an order of magnitude speedup over existing approaches.
arXiv Detail & Related papers (2024-01-29T03:42:37Z) - Evaluation of Test-Time Adaptation Under Computational Time Constraints [80.40939405129102]
Test Time Adaptation (TTA) methods leverage unlabeled data at test time to adapt to distribution shifts.
Current evaluation protocols overlook the effect of this extra cost, affecting their real-world applicability.
We propose a more realistic evaluation protocol for TTA methods, where data is received in an online fashion from a constant-speed data stream.
arXiv Detail & Related papers (2023-04-10T18:01:47Z) - Tracing Knowledge in Language Models Back to the Training Data [39.02793789536856]
We introduce a new benchmark for fact tracing: tracing language models' assertions back to the training examples that provided evidence for those predictions.
We evaluate influence methods for fact tracing, using well-understood information retrieval metrics.
arXiv Detail & Related papers (2022-05-23T17:34:16Z) - Combining Feature and Instance Attribution to Detect Artifacts [62.63504976810927]
We propose methods to facilitate identification of training data artifacts.
We show that this proposed training-feature attribution approach can be used to uncover artifacts in training data.
We execute a small user study to evaluate whether these methods are useful to NLP researchers in practice.
arXiv Detail & Related papers (2021-07-01T09:26:13Z) - An Empirical Comparison of Instance Attribution Methods for NLP [62.63504976810927]
We evaluate the degree to which different potential instance attribution agree with respect to the importance of training samples.
We find that simple retrieval methods yield training instances that differ from those identified via gradient-based methods.
arXiv Detail & Related papers (2021-04-09T01:03:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.