Set2Box: Similarity Preserving Representation Learning of Sets
- URL: http://arxiv.org/abs/2210.03282v1
- Date: Fri, 7 Oct 2022 02:11:12 GMT
- Title: Set2Box: Similarity Preserving Representation Learning of Sets
- Authors: Geon Lee, Chanyoung Park, Kijung Shin
- Abstract summary: We propose Set2Box, a learning-based approach for compressed representations of sets.
We also design Set2Box+, which yields more concise but more accurate box representations of sets.
Through experiments on 8 real-world datasets, we show that Set2Box+ is (a) Accurate: achieving up to 40.8X smaller estimation error while requiring 60% fewer bits to encode sets, (b) Concise: yielding up to 96.8X more concise representations with similar estimation error, and (c) Versatile: enabling the estimation of four set-similarity measures from a single representation of each set.
- Score: 18.85308805841525
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Sets have been used for modeling various types of objects (e.g., a document
as the set of keywords in it and a customer as the set of the items that she
has purchased). Measuring similarity (e.g., Jaccard Index) between sets has
been a key building block of a wide range of applications, including,
plagiarism detection, recommendation, and graph compression. However, as sets
have grown in numbers and sizes, the computational cost and storage required
for set similarity computation have become substantial, and this has led to the
development of hashing and sketching based solutions. In this work, we propose
Set2Box, a learning-based approach for compressed representations of sets from
which various similarity measures can be estimated accurately in constant time.
The key idea is to represent sets as boxes to precisely capture overlaps of
sets. Additionally, based on the proposed box quantization scheme, we design
Set2Box+, which yields more concise but more accurate box representations of
sets. Through extensive experiments on 8 real-world datasets, we show that,
compared to baseline approaches, Set2Box+ is (a) Accurate: achieving up to
40.8X smaller estimation error while requiring 60% fewer bits to encode sets,
(b) Concise: yielding up to 96.8X more concise representations with similar
estimation error, and (c) Versatile: enabling the estimation of four
set-similarity measures from a single representation of each set.
Related papers
- Scaling LLM Inference with Optimized Sample Compute Allocation [56.524278187351925]
We propose OSCA, an algorithm to find an optimal mix of different inference configurations.
Our experiments show that with our learned mixed allocation, we can achieve accuracy better than the best single configuration.
OSCA is also shown to be effective in agentic beyond single-turn tasks, achieving a better accuracy on SWE-Bench with 3x less compute than the default configuration.
arXiv Detail & Related papers (2024-10-29T19:17:55Z) - Weighted Point Set Embedding for Multimodal Contrastive Learning Toward Optimal Similarity Metric [44.95433989446052]
We show the benefit of our proposed method through a new understanding of the contrastive loss of CLIP.
We show that our proposed similarity based on weighted point sets consistently achieves the optimal similarity.
arXiv Detail & Related papers (2024-04-30T03:15:04Z) - FaceCoresetNet: Differentiable Coresets for Face Set Recognition [16.879093388124964]
A discriminative descriptor balances two policies when aggregating information from a given set.
This work frames face-set representation as a differentiable coreset selection problem.
We set a new SOTA to set face verification on the IJB-B and IJB-C datasets.
arXiv Detail & Related papers (2023-08-27T11:38:42Z) - Improving Cross-Modal Retrieval with Set of Diverse Embeddings [19.365974066256026]
Cross-modal retrieval across image and text modalities is a challenging task due to its inherent ambiguity.
Set-based embedding has been studied as a solution to this problem.
We present a novel set-based embedding method, which is distinct from previous work in two aspects.
arXiv Detail & Related papers (2022-11-30T05:59:23Z) - GBRS: An Unified Model of Pawlak Rough Set and Neighborhood Rough Set [67.17936132922955]
Pawlak rough set and neighborhood rough set are the two most common rough set theoretical models.
This paper presents a granular-ball rough set based on the granlar-ball computing.
arXiv Detail & Related papers (2022-01-10T14:05:02Z) - SLOSH: Set LOcality Sensitive Hashing via Sliced-Wasserstein Embeddings [18.916058638077274]
This paper focuses on non-parametric and data-independent learning from set-structured data using approximate nearest neighbor (ANN) solutions.
We propose Sliced-Wasserstein set embedding as a computationally efficient "set-2-vector" mechanism that enables downstream ANN.
We demonstrate the effectiveness of our algorithm, denoted as Set-LOcality Sensitive Hashing (SLOSH), on various set retrieval datasets.
arXiv Detail & Related papers (2021-12-11T00:10:05Z) - Mini-Batch Consistent Slot Set Encoder for Scalable Set Encoding [50.61114177411961]
We introduce a new property termed Mini-Batch Consistency that is required for large scale mini-batch set encoding.
We present a scalable and efficient set encoding mechanism that is amenable to mini-batch processing with respect to set elements and capable of updating set representations as more data arrives.
arXiv Detail & Related papers (2021-03-02T10:10:41Z) - Efficient Pure Exploration for Combinatorial Bandits with Semi-Bandit
Feedback [51.21673420940346]
Combinatorial bandits generalize multi-armed bandits, where the agent chooses sets of arms and observes a noisy reward for each arm contained in the chosen set.
We focus on the pure-exploration problem of identifying the best arm with fixed confidence, as well as a more general setting, where the structure of the answer set differs from the one of the action set.
Based on a projection-free online learning algorithm for finite polytopes, it is the first computationally efficient algorithm which is convexally optimal and has competitive empirical performance.
arXiv Detail & Related papers (2021-01-21T10:35:09Z) - Set Distribution Networks: a Generative Model for Sets of Images [22.405670277339023]
We introduce Set Distribution Networks (SDNs), a framework that learns to autoencode and freely generate sets.
We show that SDNs are able to reconstruct image sets that preserve salient attributes of the inputs in our benchmark datasets.
We examine the sets generated by SDN with a pre-trained 3D reconstruction network and a face verification network, respectively, as a novel way to evaluate the quality of generated sets of images.
arXiv Detail & Related papers (2020-06-18T17:38:56Z) - Rethinking Object Detection in Retail Stores [55.359582952686175]
We propose a new task, simultaneously object localization and counting, abbreviated as Locount.
Locount requires algorithms to localize groups of objects of interest with the number of instances.
We collect a large-scale object localization and counting dataset with rich annotations in retail stores.
arXiv Detail & Related papers (2020-03-18T14:01:54Z) - Learn to Predict Sets Using Feed-Forward Neural Networks [63.91494644881925]
This paper addresses the task of set prediction using deep feed-forward neural networks.
We present a novel approach for learning to predict sets with unknown permutation and cardinality.
We demonstrate the validity of our set formulations on relevant vision problems.
arXiv Detail & Related papers (2020-01-30T01:52:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.