Related papers: A Hierarchical Approach to Scaling Batch Active Search Over Structured Data

A Hierarchical Approach to Scaling Batch Active Search Over Structured Data

URL: http://arxiv.org/abs/2007.10263v1
Date: Mon, 20 Jul 2020 16:50:25 GMT
Title: A Hierarchical Approach to Scaling Batch Active Search Over Structured Data
Authors: Vivek Myers and Peyton Greenside
Abstract summary: We present a general hierarchical framework based on bandit algorithms to scale active search to large batch sizes. We focus our application of HBBS on modern biology, where large batch experimentation is often fundamental to the research process.
Score: 0.5076419064097732
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Active search is the process of identifying high-value data points in a large and often high-dimensional parameter space that can be expensive to evaluate. Traditional active search techniques like Bayesian optimization trade off exploration and exploitation over consecutive evaluations, and have historically focused on single or small (<5) numbers of examples evaluated per round. As modern data sets grow, so does the need to scale active search to large data sets and batch sizes. In this paper, we present a general hierarchical framework based on bandit algorithms to scale active search to large batch sizes by maximizing information derived from the unique structure of each dataset. Our hierarchical framework, Hierarchical Batch Bandit Search (HBBS), strategically distributes batch selection across a learned embedding space by facilitating wide exploration of different structural elements within a dataset. We focus our application of HBBS on modern biology, where large batch experimentation is often fundamental to the research process, and demonstrate batch design of biological sequences (protein and DNA). We also present a new Gym environment to easily simulate diverse biological sequences and to enable more comprehensive evaluation of active search methods across heterogeneous data sets. The HBBS framework improves upon standard performance, wall-clock, and scalability benchmarks for batch search by using a broad exploration strategy across coarse partitions and fine-grained exploitation within each partition of structured data.

Related papers

Rethinking Chunk Size For Long-Document Retrieval: A Multi-Dataset Analysis [0.0]
We evaluate fixed-size chunking strategies and their influence on retrieval performance using multiple embedding models.<n>Our experiments, conducted on both short-form and long-form datasets, reveal that chunk size plays a critical role in retrieval effectiveness.
arXiv Detail & Related papers (2025-05-27T19:39:16Z)
Adaptive and Robust DBSCAN with Multi-agent Reinforcement Learning [53.527506374566485]
We propose a novel Adaptive and Robust DBSCAN with Multi-agent Reinforcement Learning cluster framework, namely AR-DBSCAN.<n>We show that AR-DBSCAN not only improves clustering accuracy by up to 144.1% and 175.3% in the NMI and ARI metrics, respectively, but also is capable of robustly finding dominant parameters.
arXiv Detail & Related papers (2025-05-07T11:37:23Z)
ROGRAG: A Robustly Optimized GraphRAG Framework [45.947928801693266]
Graph-based retrieval-augmented generation (GraphRAG) addresses this by structuring domain knowledge as a graph for dynamic retrieval.<n>Existing pipelines involve complex engineering, making it difficult to isolate the impact of individual components.<n>We introduce ROGRAG, a Robustly Optimized GraphRAG framework, which integrates dual-level with logic form retrieval methods to improve robustness without increasing computational cost.
arXiv Detail & Related papers (2025-03-09T06:20:24Z)
HiBO: Hierarchical Bayesian Optimization via Adaptive Search Space Partitioning [0.7737746260673106]
HiBO is a novel hierarchical algorithm integrating global-level search space partitioning information into the acquisition strategy of a local BO-based. A set of evaluations demonstrates that HiBO outperforms state-of-the-art methods in high-dimensional synthetic benchmarks.
arXiv Detail & Related papers (2024-10-30T16:04:16Z)
FOR-instance: a UAV laser scanning benchmark dataset for semantic and instance segmentation of individual trees [0.06597195879147556]
FOR-instance dataset comprises five curated and ML-ready UAV-based laser scanning data collections. The dataset is divided into development and test subsets, enabling method advancement and evaluation. The inclusion of diameter at breast height data expands its utility to the measurement of a classic tree variable.
arXiv Detail & Related papers (2023-09-03T22:08:29Z)
Large-scale Fully-Unsupervised Re-Identification [78.47108158030213]
We propose two strategies to learn from large-scale unlabeled data. The first strategy performs a local neighborhood sampling to reduce the dataset size in each without violating neighborhood relationships. A second strategy leverages a novel Re-Ranking technique, which has a lower time upper bound complexity and reduces the memory complexity from O(n2) to O(kn) with k n.
arXiv Detail & Related papers (2023-07-26T16:19:19Z)
Towards Personalized Preprocessing Pipeline Search [52.59156206880384]
ClusterP3S is a novel framework for Personalized Preprocessing Pipeline Search via Clustering. We propose a hierarchical search strategy to jointly learn the clusters and search for the optimal pipelines. Experiments on benchmark classification datasets demonstrate the effectiveness of enabling feature-wise preprocessing pipeline search.
arXiv Detail & Related papers (2023-02-28T05:45:05Z)
Scalable Batch Acquisition for Deep Bayesian Active Learning [70.68403899432198]
In deep active learning, it is important to choose multiple examples to markup at each step. Existing solutions to this problem, such as BatchBALD, have significant limitations in selecting a large number of examples. We present the Large BatchBALD algorithm, which aims to achieve comparable quality while being more computationally efficient.
arXiv Detail & Related papers (2023-01-13T11:45:17Z)
Frequent Itemset-driven Search for Finding Minimum Node Separators in Complex Networks [61.2383572324176]
We propose a frequent itemset-driven search approach, which integrates the concept of frequent itemset mining in data mining into the well-known memetic search framework. It iteratively employs the frequent itemset recombination operator to generate promising offspring solution based on itemsets that frequently occur in high-quality solutions. In particular, it discovers 29 new upper bounds and matches 18 previous best-known bounds.
arXiv Detail & Related papers (2022-01-18T11:16:40Z)
Towards General and Efficient Active Learning [20.888364610175987]
Active learning aims to select the most informative samples to exploit limited annotation budgets. We propose a novel general and efficient active learning (GEAL) method in this paper. Our method can conduct data selection processes on different datasets with a single-pass inference of the same model.
arXiv Detail & Related papers (2021-12-15T08:35:28Z)
Multidimensional Assignment Problem for multipartite entity resolution [69.48568967931608]
Multipartite entity resolution aims at integrating records from multiple datasets into one entity. We apply two procedures, a Greedy algorithm and a large scale neighborhood search, to solve the assignment problem. We find evidence that design-based multi-start can be more efficient as the size of databases grow large.
arXiv Detail & Related papers (2021-12-06T20:34:55Z)
Structural Textile Pattern Recognition and Processing Based on Hypergraphs [2.4963790083110426]
We introduce an approach for recognising similar weaving patterns based on their structures for textile archives. We first represent textile structures using hypergraphs and extract multisets of k-neighbourhoods describing weaving patterns from these graphs. The resulting multisets are clustered using various distance measures and various clustering algorithms.
arXiv Detail & Related papers (2021-03-21T00:44:40Z)
Learning from Data to Speed-up Sorted Table Search Procedures: Methodology and Practical Guidelines [0.0]
We study to what extend Machine Learning Techniques can contribute to obtain such a speed-up. We characterize the scenarios in which those latter can be profitably used with respect to the former, accounting for both CPU and GPU computing. Indeed, we formalize an Algorithmic Paradigm of Learned Dichotomic Sorted Table Search procedures that naturally complements the Learned one proposed here and that characterizes most of the known Sorted Table Search Procedures as having a "learning phase" that approximates Simple Linear Regression.
arXiv Detail & Related papers (2020-07-20T16:26:54Z)
AutoSTR: Efficient Backbone Search for Scene Text Recognition [80.7290173000068]
Scene text recognition (STR) is very challenging due to the diversity of text instances and the complexity of scenes. We propose automated STR (AutoSTR) to search data-dependent backbones to boost text recognition performance. Experiments demonstrate that, by searching data-dependent backbones, AutoSTR can outperform the state-of-the-art approaches on standard benchmarks.
arXiv Detail & Related papers (2020-03-14T06:51:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.