Related papers: TACOS: Open Tagging and Comparative Scoring for Instruction Fine-Tuning Data Selection

TACOS: Open Tagging and Comparative Scoring for Instruction Fine-Tuning Data Selection

URL: http://arxiv.org/abs/2507.03673v1
Date: Fri, 04 Jul 2025 15:46:07 GMT
Title: TACOS: Open Tagging and Comparative Scoring for Instruction Fine-Tuning Data Selection
Authors: Xixiang He, Hao Yu, Qiyao Sun, Ao Cheng, Tailai Zhang, Cong Liu, Shuxuan Guo,
Abstract summary: We present TACOS, an innovative method that integrates Open Tagging and Comparative Scoring for IFT data selection.<n>To capture data diversity, we leverage LLMs to assign open-domain tags to human queries.<n>We suggest a comparative scoring method that allows the relative quality evaluation of samples within a cluster, avoiding inconsistent criteria seen in singleton-based evaluations.
Score: 9.020110377060153
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Instruction Fine-Tuning (IFT) is crucial for aligning large language models (LLMs) with human preferences, and selecting a small yet representative subset from massive data significantly facilitates IFT in terms of both efficiency and effectiveness. Nevertheless, existing approaches suffer from two limitations: the use of simple heuristics restricts data diversity, while the singleton data quality evaluation accounts for inconsistent criteria between independent samples. To address the issues, we present TACOS, an innovative method that integrates Open Tagging and Comparative Scoring for IFT data selection. To capture data diversity, we leverage LLMs to assign open-domain tags to human queries, followed by a normalization stage to denoise the open tags and enable efficient clustering. Additionally, we suggest a comparative scoring method that allows the relative quality evaluation of samples within a cluster, avoiding inconsistent criteria seen in singleton-based evaluations. Extensive experiments across diverse datasets and LLM architectures demonstrate that TACOS outperforms existing approaches by a large margin. Notably, it achieves superior instruction-following performance on MT-Bench and ranks 1st among LLaMA2-7B-Based models on AlpacaEval 2.0, illustrating its efficacy for IFT data selection.

Related papers

Stratified Selective Sampling for Instruction Tuning with Dedicated Scoring Strategy [1.8666174950012007]
We show that data selection can be both -- efficient and universal -- by using a multi-step pipeline.<n>We use task-based categorization to control the composition of our final data.<n>This integrated strategy enables high-performance fine-tuning with minimal overhead.
arXiv Detail & Related papers (2025-05-28T09:22:25Z)
ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment [94.36403843133616]
Using human preferences to align large language models (LLMs) has significantly improved their performance in various downstream tasks.<n>Existing methods either lack a strong theoretical foundation or depend on restrictive reward function assumptions.<n>We propose an algorithm, ActiveDPO, that uses a theoretically grounded data selection criterion for non-linear reward functions.
arXiv Detail & Related papers (2025-05-25T17:42:52Z)
Unbiased Max-Min Embedding Classification for Transductive Few-Shot Learning: Clustering and Classification Are All You Need [83.10178754323955]
Few-shot learning enables models to generalize from only a few labeled examples.<n>We propose the Unbiased Max-Min Embedding Classification (UMMEC) Method, which addresses the key challenges in few-shot learning.<n>Our method significantly improves classification performance with minimal labeled data, advancing the state-of-the-art in annotatedL.
arXiv Detail & Related papers (2025-03-28T07:23:07Z)
MLLM-Selector: Necessity and Diversity-driven High-Value Data Selection for Enhanced Visual Instruction Tuning [69.7347209018861]
We introduce MLLM-Selector, an automated approach that identifies valuable data for visual instruction tuning.<n>We calculate necessity scores for each sample in the VIT data pool to identify samples pivotal for enhancing model performance.<n>Our findings underscore the importance of mixing necessity and diversity in data choice, leading to the creation of MLLM-Selector.
arXiv Detail & Related papers (2025-03-26T12:42:37Z)
Add-One-In: Incremental Sample Selection for Large Language Models via a Choice-Based Greedy Paradigm [41.4789135538612]
This paper introduces a novel choice-based sample selection framework that shifts the focus from evaluating individual sample quality to comparing the contribution value of different samples.<n>Thanks to the advanced language understanding capabilities of Large Language Models (LLMs), we utilize LLMs to evaluate the value of each option during the selection process.
arXiv Detail & Related papers (2025-03-04T07:32:41Z)
Beyond QA Pairs: Assessing Parameter-Efficient Fine-Tuning for Fact Embedding in LLMs [0.0]
This paper focuses on improving the fine-tuning process by categorizing question-answer pairs into Factual and Conceptual classes.<n>Two distinct Llama-2 models are fine-tuned based on these classifications and evaluated using larger models like GPT-3.5 Turbo and Gemini.<n>Our results indicate that models trained on conceptual datasets outperform those trained on factual datasets.
arXiv Detail & Related papers (2025-03-03T03:26:30Z)
Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning [71.2981957820888]
We propose a novel Star-Agents framework, which automates the enhancement of data quality across datasets. The framework initially generates diverse instruction data with multiple LLM agents through a bespoke sampling method. The generated data undergo a rigorous evaluation using a dual-model method that assesses both difficulty and quality.
arXiv Detail & Related papers (2024-11-21T02:30:53Z)
Improving Model Evaluation using SMART Filtering of Benchmark Datasets [19.731378662304497]
We propose a novel approach to select high-quality subsets of examples from existing benchmark datasets.<n>Our approach applies three filtering criteria, removing (i) easy examples, (ii) data-contaminated examples, and (iii) examples that are similar to each other.<n>We demonstrate the effectiveness of SMART on three multiple choice QA datasets.
arXiv Detail & Related papers (2024-10-26T18:21:44Z)
Dual-Decoupling Learning and Metric-Adaptive Thresholding for Semi-Supervised Multi-Label Learning [81.83013974171364]
Semi-supervised multi-label learning (SSMLL) is a powerful framework for leveraging unlabeled data to reduce the expensive cost of collecting precise multi-label annotations.<n>Unlike semi-supervised learning, one cannot select the most probable label as the pseudo-label in SSMLL due to multiple semantics contained in an instance.<n>We propose a dual-perspective method to generate high-quality pseudo-labels.
arXiv Detail & Related papers (2024-07-26T09:33:53Z)
Empowering HWNs with Efficient Data Labeling: A Clustered Federated Semi-Supervised Learning Approach [2.046985601687158]
Clustered Federated Multitask Learning (CFL) has gained considerable attention as an effective strategy for overcoming statistical challenges. We introduce a novel framework, Clustered Federated Semi-Supervised Learning (CFSL), designed for more realistic HWN scenarios. Our results demonstrate that CFSL significantly improves upon key metrics such as testing accuracy, labeling accuracy, and labeling latency under varying proportions of labeled and unlabeled data.
arXiv Detail & Related papers (2024-01-19T11:47:49Z)
From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning [52.257422715393574]
We introduce a self-guided methodology for Large Language Models (LLMs) to autonomously discern and select cherry samples from open-source datasets. Our key innovation, the Instruction-Following Difficulty (IFD) metric, emerges as a pivotal metric to identify discrepancies between a model's expected responses and its intrinsic generation capability.
arXiv Detail & Related papers (2023-08-23T09:45:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.