TBDFiltering: Sample-Efficient Tree-Based Data Filtering
- URL: http://arxiv.org/abs/2601.22016v1
- Date: Thu, 29 Jan 2026 17:22:06 GMT
- Title: TBDFiltering: Sample-Efficient Tree-Based Data Filtering
- Authors: Robert Istvan Busa-Fekete, Julian Zimmert, Anne Xiangyi Zheng, Claudio Gentile, Andras Gyorgy,
- Abstract summary: The quality of machine learning models depends heavily on their training data.<n>We propose a text-embedding-based hierarchical clustering approach that adaptively selects the documents to be evaluated.<n>Our method can correctly predict the quality of each document after querying a small number of documents.
- Score: 19.186418132888182
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The quality of machine learning models depends heavily on their training data. Selecting high-quality, diverse training sets for large language models (LLMs) is a difficult task, due to the lack of cheap and reliable quality metrics. While querying existing LLMs for document quality is common, this is not scalable to the large number (billions) of documents used in training. Instead, practitioners often use classifiers trained on sparse quality signals. In this paper, we propose a text-embedding-based hierarchical clustering approach that adaptively selects the documents to be evaluated by the LLM to estimate cluster quality. We prove that our method is query efficient: under the assumption that the hierarchical clustering contains a subtree such that each leaf cluster in the tree is pure enough (i.e., it mostly contains either only good or only bad documents), with high probability, the method can correctly predict the quality of each document after querying a small number of documents. The number of such documents is proportional to the size of the smallest subtree with (almost) pure leaves, without the algorithm knowing this subtree in advance. Furthermore, in a comprehensive experimental study, we demonstrate the benefits of our algorithm compared to other classifier-based filtering methods.
Related papers
- Hierarchical Retrieval: The Geometry and a Pretrain-Finetune Recipe [42.35197658021889]
Dual encoder (DE) models, where a pair of matching query and document are embedded into similar vector representations, are widely used in information retrieval.<n>This paper investigates such limitations in the context of hierarchical retrieval (HR), where the document set has a hierarchical structure and the matching documents for a query are all of its ancestors.<n>We introduce a pretrain-finetune recipe that significantly improves long-distance retrieval without sacrificing performance on closer documents.
arXiv Detail & Related papers (2025-09-19T20:35:58Z) - ReTreever: Tree-based Coarse-to-Fine Representations for Retrieval [64.44265315244579]
We propose a tree-based method for organizing and representing reference documents at various granular levels.<n>Our method, called ReTreever, jointly learns a routing function per internal node of a binary tree such that query and reference documents are assigned to similar tree branches.<n>Our evaluations show that ReTreever generally preserves full representation accuracy.
arXiv Detail & Related papers (2025-02-11T21:35:13Z) - Information-Theoretic Generative Clustering of Documents [24.56214029342293]
We present generative clustering (GC) for clustering a set of documents, $mathrmX$.<n>Because large language models (LLMs) provide probability distributions, the similarity between two documents can be rigorously defined.<n>We show GC achieves the state-of-the-art performance, outperforming any previous clustering method often by a large margin.
arXiv Detail & Related papers (2024-12-18T06:21:21Z) - Less is More: Making Smaller Language Models Competent Subgraph Retrievers for Multi-hop KGQA [51.3033125256716]
We model the subgraph retrieval task as a conditional generation task handled by small language models.
Our base generative subgraph retrieval model, consisting of only 220M parameters, competitive retrieval performance compared to state-of-the-art models.
Our largest 3B model, when plugged with an LLM reader, sets new SOTA end-to-end performance on both the WebQSP and CWQ benchmarks.
arXiv Detail & Related papers (2024-10-08T15:22:36Z) - PromptReps: Prompting Large Language Models to Generate Dense and Sparse Representations for Zero-Shot Document Retrieval [76.50690734636477]
We propose PromptReps, which combines the advantages of both categories: no need for training and the ability to retrieve from the whole corpus.
The retrieval system harnesses both dense text embedding and sparse bag-of-words representations.
arXiv Detail & Related papers (2024-04-29T04:51:30Z) - Zero-Shot Listwise Document Reranking with a Large Language Model [58.64141622176841]
We propose Listwise Reranker with a Large Language Model (LRL), which achieves strong reranking effectiveness without using any task-specific training data.
Experiments on three TREC web search datasets demonstrate that LRL not only outperforms zero-shot pointwise methods when reranking first-stage retrieval results, but can also act as a final-stage reranker.
arXiv Detail & Related papers (2023-05-03T14:45:34Z) - Document Provenance and Authentication through Authorship Classification [5.2545206693029884]
We propose an ensemble-based text-processing framework for the classification of single and multi-authored documents.
The proposed framework incorporates several state-of-the-art text classification algorithms.
The framework is evaluated on a large-scale benchmark dataset.
arXiv Detail & Related papers (2023-03-02T12:26:03Z) - LeQua@CLEF2022: Learning to Quantify [76.22817970624875]
LeQua 2022 is a new lab for the evaluation of methods for learning to quantify'' in textual datasets.
The goal of this lab is to provide a setting for the comparative evaluation of methods for learning to quantify, both in the binary setting and in the single-label multiclass setting.
arXiv Detail & Related papers (2021-11-22T14:54:20Z) - Comparative Study of Long Document Classification [0.0]
We revisit long document classification using standard machine learning approaches.
We benchmark approaches ranging from simple Naive Bayes to complex BERT on six standard text classification datasets.
arXiv Detail & Related papers (2021-11-01T04:51:51Z) - Pre-training Tasks for Embedding-based Large-scale Retrieval [68.01167604281578]
We consider the large-scale query-document retrieval problem.
Given a query (e.g., a question), return the set of relevant documents from a large document corpus.
We show that the key ingredient of learning a strong embedding-based Transformer model is the set of pre-training tasks.
arXiv Detail & Related papers (2020-02-10T16:44:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.