Linking Cryptoasset Attribution Tags to Knowledge Graph Entities: An LLM-based Approach
- URL: http://arxiv.org/abs/2502.10453v1
- Date: Wed, 12 Feb 2025 01:28:40 GMT
- Title: Linking Cryptoasset Attribution Tags to Knowledge Graph Entities: An LLM-based Approach
- Authors: Régnier Avice, Bernhard Haslhofer, Zhidong Li, Jianlong Zhou,
- Abstract summary: We propose a novel computational method based on Large Language Models (LLMs) to link attribution tags with well-defined knowledge graph concepts.<n>Our approach outperforms baseline methods by up to 37.4% in F1-score across three publicly available attribution tag datasets.<n>Our method not only enhances attribution tag quality but also serves as a blueprint for fostering more reliable forensic evidence.
- Score: 4.348296766881638
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Attribution tags form the foundation of modern cryptoasset forensics. However, inconsistent or incorrect tags can mislead investigations and even result in false accusations. To address this issue, we propose a novel computational method based on Large Language Models (LLMs) to link attribution tags with well-defined knowledge graph concepts. We implemented this method in an end-to-end pipeline and conducted experiments showing that our approach outperforms baseline methods by up to 37.4% in F1-score across three publicly available attribution tag datasets. By integrating concept filtering and blocking procedures, we generate candidate sets containing five knowledge graph entities, achieving a recall of 93% without the need for labeled data. Additionally, we demonstrate that local LLM models can achieve F1-scores of 90%, comparable to remote models which achieve 94%. We also analyze the cost-performance trade-offs of various LLMs and prompt templates, showing that selecting the most cost-effective configuration can reduce costs by 90%, with only a 1% decrease in performance. Our method not only enhances attribution tag quality but also serves as a blueprint for fostering more reliable forensic evidence.
Related papers
- Evaluating the Use of LLMs for Documentation to Code Traceability [3.076436880934678]
Large Language Models can establish trace links between various software documentation and source code.<n>We create two novel datasets from two open-source projects (Unity Catalog and Crawl4AI)<n>Results show that the best-performing LLM achieves F1-scores of 79.4% and 80.4% across the two datasets.
arXiv Detail & Related papers (2025-06-19T16:18:53Z) - Document Attribution: Examining Citation Relationships using Large Language Models [62.46146670035751]
We propose a zero-shot approach that frames attribution as a straightforward textual entailment task.<n>We also explore the role of the attention mechanism in enhancing the attribution process.
arXiv Detail & Related papers (2025-05-09T04:40:11Z) - Leveraging Large Language Models for Effective Label-free Node Classification in Text-Attributed Graphs [10.538099379851198]
Locle is an active self-training framework that does Label-free node Classification with LLMs cost-Effectively.
It iteratively identifies small sets of "critical" samples using GNNs and extracts informative pseudo-labels for them with both LLMs and GNNs.
It significantly outperforms state-of-the-art methods under the same query budget to LLMs in terms of label-free node classification.
arXiv Detail & Related papers (2024-12-16T17:04:40Z) - Training Task Experts through Retrieval Based Distillation [55.46054242512261]
We present Retrieval Based Distillation (ReBase), a method that first retrieves data from rich online sources and then transforms them into domain-specific data.
Our method significantly improves performance by up to 7.8% on SQuAD, 1.37% on MNLI, and 1.94% on BigBench-Hard.
arXiv Detail & Related papers (2024-07-07T18:27:59Z) - Learning from Litigation: Graphs and LLMs for Retrieval and Reasoning in eDiscovery [6.037276428689637]
This paper introduces DISCOvery Graph (DISCOG), a hybrid approach that combines the strengths of two worlds: a graph-based method for accurate document relevance prediction.
Our approach drastically reduces document review costs by 99.9% compared to manual processes and by 95% compared to LLM-based classification methods.
arXiv Detail & Related papers (2024-05-29T15:08:55Z) - Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [55.96599486604344]
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process.
We use Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals.
The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data.
arXiv Detail & Related papers (2024-05-01T11:10:24Z) - Pearls from Pebbles: Improved Confidence Functions for Auto-labeling [51.44986105969375]
threshold-based auto-labeling (TBAL) works by finding a threshold on a model's confidence scores above which it can accurately label unlabeled data points.
We propose a framework for studying the emphoptimal TBAL confidence function.
We develop a new post-hoc method specifically designed to maximize performance in TBAL systems.
arXiv Detail & Related papers (2024-04-24T20:22:48Z) - EntGPT: Linking Generative Large Language Models with Knowledge Bases [8.557683104631883]
We introduce EntGPT, employing advanced prompt engineering to enhance EL tasks.<n>Our three-step hard-prompting method (EntGPT-P) significantly boosts the micro-F_1 score by up to 36% over vanilla prompts.<n>Our instruction tuning method (EntGPT-I) improves micro-F_1 scores by 2.1% on average in supervised EL tasks.
arXiv Detail & Related papers (2024-02-09T19:16:27Z) - DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection [72.25697820290502]
This work introduces a straightforward and efficient strategy to identify potential novel classes through zero-shot classification.
We refer to this approach as the self-training strategy, which enhances recall and accuracy for novel classes without requiring extra annotations, datasets, and re-training.
Empirical evaluations on three datasets, including LVIS, V3Det, and COCO, demonstrate significant improvements over the baseline performance.
arXiv Detail & Related papers (2023-10-02T17:52:24Z) - CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding [86.79903269137971]
Unsupervised visual grounding has been developed to locate regions using pseudo-labels.
We propose CLIP-VG, a novel method that can conduct self-paced curriculum adapting of CLIP with pseudo-language labels.
Our method outperforms the current state-of-the-art unsupervised method by a significant margin on RefCOCO/+/g datasets.
arXiv Detail & Related papers (2023-05-15T14:42:02Z) - A new weakly supervised approach for ALS point cloud semantic
segmentation [1.4620086904601473]
We propose a deep-learning based weakly supervised framework for semantic segmentation of ALS point clouds.
We exploit potential information from unlabeled data subject to incomplete and sparse labels.
Our method achieves an overall accuracy of 83.0% and an average F1 score of 70.0%, which have increased by 6.9% and 12.8% respectively.
arXiv Detail & Related papers (2021-10-04T14:00:23Z) - To be Critical: Self-Calibrated Weakly Supervised Learning for Salient
Object Detection [95.21700830273221]
Weakly-supervised salient object detection (WSOD) aims to develop saliency models using image-level annotations.
We propose a self-calibrated training strategy by explicitly establishing a mutual calibration loop between pseudo labels and network predictions.
We prove that even a much smaller dataset with well-matched annotations can facilitate models to achieve better performance as well as generalizability.
arXiv Detail & Related papers (2021-09-04T02:45:22Z) - Uncertainty-aware Self-training for Text Classification with Few Labels [54.13279574908808]
We study self-training as one of the earliest semi-supervised learning approaches to reduce the annotation bottleneck.
We propose an approach to improve self-training by incorporating uncertainty estimates of the underlying neural network.
We show our methods leveraging only 20-30 labeled samples per class for each task for training and for validation can perform within 3% of fully supervised pre-trained language models.
arXiv Detail & Related papers (2020-06-27T08:13:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.