From Tags to Trees: Structuring Fine-Grained Knowledge for Controllable Data Selection in LLM Instruction Tuning
- URL: http://arxiv.org/abs/2601.13995v1
- Date: Tue, 20 Jan 2026 14:06:51 GMT
- Title: From Tags to Trees: Structuring Fine-Grained Knowledge for Controllable Data Selection in LLM Instruction Tuning
- Authors: Zihan Niu, Wenping Hu, Junmin Chen, Xiyue Wang, Tong Xu, Ruiming Tang,
- Abstract summary: Tree-aware Aligned Global Sampling (TAGS) is a unified framework that leverages a knowledge tree built from fine-grained tags.<n>Our controllable sampling strategy maximizes tree-level information gain and enforces leaf-level alignment via KL-divergence for specific domains.
- Score: 31.186300383302708
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Effective and controllable data selection is critical for LLM instruction tuning, especially with massive open-source datasets. Existing approaches primarily rely on instance-level quality scores, or diversity metrics based on embedding clusters or semantic tags. However, constrained by the flatness of embedding spaces or the coarseness of tags, these approaches overlook fine-grained knowledge and its intrinsic hierarchical dependencies, consequently hindering precise data valuation and knowledge-aligned sampling. To address this challenge, we propose Tree-aware Aligned Global Sampling (TAGS), a unified framework that leverages a knowledge tree built from fine-grained tags, thereby enabling joint control of global quality, diversity, and target alignment. Using an LLM-based tagger, we extract atomic knowledge concepts, which are organized into a global tree through bottom-up hierarchical clustering. By grounding data instances onto this tree, a tree-aware metric then quantifies data quality and diversity, facilitating effective sampling. Our controllable sampling strategy maximizes tree-level information gain and enforces leaf-level alignment via KL-divergence for specific domains. Extensive experiments demonstrate that TAGS significantly outperforms state-of-the-art baselines. Notably, it surpasses the full-dataset model by \textbf{+5.84\%} using only \textbf{5\%} of the data, while our aligned sampling strategy further boosts average performance by \textbf{+4.24\%}.
Related papers
- Learning Order Forest for Qualitative-Attribute Data Clustering [52.612779710298526]
This paper discovers a tree-like distance structure to flexibly represent the local order relationship among intra-attribute qualitative values.<n>A joint learning mechanism is proposed to iteratively obtain more appropriate tree structures and clusters.<n>Experiments demonstrate that the joint learning adapts the forest to the clustering task to yield accurate results.
arXiv Detail & Related papers (2026-03-03T07:49:50Z) - Bridging the Semantic Gap for Categorical Data Clustering via Large Language Models [64.58262227709842]
ARISE (Attention-weighted Representation with Integrated Semantic Embeddings) is presented.<n>It builds semantic-aware representations that complement the metric space of categorical data for accurate clustering.<n>Experiments on eight benchmark datasets demonstrate consistent improvements over seven representative counterparts.
arXiv Detail & Related papers (2026-01-03T11:37:46Z) - Online hierarchical partitioning of the output space in extreme multi-label data stream [2.474908349649168]
This work introduces iHOMER, an online multi-label learning framework that partitions the label space into disjoint clusters, correlated without relying on predefined hierarchies.<n>Experiments across 23 real-world datasets show iHOMER outperforms 5 state-of-the-art global baselines, such as MLHAT, MLHT of Pruned Sets and iSOUPT, by 23%, and 12 local baselines, such as binary relevance transformations of kNN, EFDT, ARF, and ADWIN bagging/boosting ensembles, by 32%, establishing its robustness for online multi-label classification.
arXiv Detail & Related papers (2025-07-28T14:47:13Z) - Divide-Then-Rule: A Cluster-Driven Hierarchical Interpolator for Attribute-Missing Graphs [51.13363550716544]
Deep graph clustering is an unsupervised task aimed at partitioning nodes with incomplete attributes into distinct clusters.<n>Existing imputation methods for attribute-missing graphs often fail to account for the varying amounts of information available across node neighborhoods.<n>We propose Divide-Then-Rule Graph Completion (DTRGC) to address this issue.
arXiv Detail & Related papers (2025-07-12T03:33:19Z) - Adaptive and Robust DBSCAN with Multi-agent Reinforcement Learning [53.527506374566485]
We propose a novel Adaptive and Robust DBSCAN with Multi-agent Reinforcement Learning cluster framework, namely AR-DBSCAN.<n>We show that AR-DBSCAN not only improves clustering accuracy by up to 144.1% and 175.3% in the NMI and ARI metrics, respectively, but also is capable of robustly finding dominant parameters.
arXiv Detail & Related papers (2025-05-07T11:37:23Z) - Incorporating Attributes and Multi-Scale Structures for Heterogeneous Graph Contrastive Learning [8.889313669713918]
We propose a novel contrastive learning framework for heterogeneous graphs (ASHGCL)<n>ASHGCL incorporates three distinct views, each focusing on node attributes, high-order and low-order structural information, respectively.<n>We introduce an attribute-enhanced positive sample selection strategy that combines both structural information and attribute information.
arXiv Detail & Related papers (2025-03-18T05:15:21Z) - scTree: Discovering Cellular Hierarchies in the Presence of Batch Effects in scRNA-seq Data [12.01555110624794]
scTree corrects for batch effects while simultaneously learning a tree-structured data representation.
We show empirically on seven datasets that scTree discovers the underlying clusters of the data.
arXiv Detail & Related papers (2024-06-27T16:16:55Z) - infoVerse: A Universal Framework for Dataset Characterization with
Multidimensional Meta-information [68.76707843019886]
infoVerse is a universal framework for dataset characterization.
infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information.
In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
arXiv Detail & Related papers (2023-05-30T18:12:48Z) - Hierarchical clustering with dot products recovers hidden tree structure [53.68551192799585]
In this paper we offer a new perspective on the well established agglomerative clustering algorithm, focusing on recovery of hierarchical structure.
We recommend a simple variant of the standard algorithm, in which clusters are merged by maximum average dot product and not, for example, by minimum distance or within-cluster variance.
We demonstrate that the tree output by this algorithm provides a bona fide estimate of generative hierarchical structure in data, under a generic probabilistic graphical model.
arXiv Detail & Related papers (2023-05-24T11:05:12Z) - Interactive Steering of Hierarchical Clustering [30.371250297444703]
We present an interactive steering method to visually supervise constrained hierarchical clustering by utilizing both public knowledge (e.g., Wikipedia) and private knowledge from users.
The novelty of our approach includes 1) automatically constructing constraints for hierarchical clustering using knowledge (knowledge-driven) and intrinsic data distribution (data-driven)
To clearly convey the hierarchical clustering results, an uncertainty-aware tree visualization has been developed to enable users to quickly locate the most uncertain sub-hierarchies.
arXiv Detail & Related papers (2020-09-21T05:26:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.