Similar Data Points Identification with LLM: A Human-in-the-loop Strategy Using Summarization and Hidden State Insights
- URL: http://arxiv.org/abs/2404.04281v1
- Date: Wed, 3 Apr 2024 03:17:28 GMT
- Title: Similar Data Points Identification with LLM: A Human-in-the-loop Strategy Using Summarization and Hidden State Insights
- Authors: Xianlong Zeng, Fanghao Song, Ang Liu,
- Abstract summary: This study introduces a simple yet effective method for identifying similar data points across non-free text domains.
Our two-step approach involves data point summarization and hidden state extraction.
We demonstrate the effectiveness of our method in identifying similar data points on multiple datasets.
- Score: 0.29260385019352086
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This study introduces a simple yet effective method for identifying similar data points across non-free text domains, such as tabular and image data, using Large Language Models (LLMs). Our two-step approach involves data point summarization and hidden state extraction. Initially, data is condensed via summarization using an LLM, reducing complexity and highlighting essential information in sentences. Subsequently, the summarization sentences are fed through another LLM to extract hidden states, serving as compact, feature-rich representations. This approach leverages the advanced comprehension and generative capabilities of LLMs, offering a scalable and efficient strategy for similarity identification across diverse datasets. We demonstrate the effectiveness of our method in identifying similar data points on multiple datasets. Additionally, our approach enables non-technical domain experts, such as fraud investigators or marketing operators, to quickly identify similar data points tailored to specific scenarios, demonstrating its utility in practical applications. In general, our results open new avenues for leveraging LLMs in data analysis across various domains.
Related papers
- LLM Inference Unveiled: Survey and Roofline Model Insights [62.92811060490876]
Large Language Model (LLM) inference is rapidly evolving, presenting a unique blend of opportunities and challenges.
Our survey stands out from traditional literature reviews by not only summarizing the current state of research but also by introducing a framework based on roofline model.
This framework identifies the bottlenecks when deploying LLMs on hardware devices and provides a clear understanding of practical problems.
arXiv Detail & Related papers (2024-02-26T07:33:05Z) - CLIP-Loc: Multi-modal Landmark Association for Global Localization in
Object-based Maps [0.16492989697868893]
This paper describes a multi-modal data association method for global localization using object-based maps and camera images.
We propose labeling landmarks with natural language descriptions and extracting correspondences based on conceptual similarity with image observations.
arXiv Detail & Related papers (2024-02-08T22:59:12Z) - infoVerse: A Universal Framework for Dataset Characterization with
Multidimensional Meta-information [68.76707843019886]
infoVerse is a universal framework for dataset characterization.
infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information.
In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
arXiv Detail & Related papers (2023-05-30T18:12:48Z) - Demonstration of InsightPilot: An LLM-Empowered Automated Data
Exploration System [48.62158108517576]
We introduce InsightPilot, an automated data exploration system designed to simplify the data exploration process.
InsightPilot automatically selects appropriate analysis intents, such as understanding, summarizing, and explaining.
In brief, an IQuery is an abstraction and automation of data analysis operations, which mimics the approach of data analysts.
arXiv Detail & Related papers (2023-04-02T07:27:49Z) - Spatiotemporal Self-supervised Learning for Point Clouds in the Wild [65.56679416475943]
We introduce an SSL strategy that leverages positive pairs in both the spatial and temporal domain.
We demonstrate the benefits of our approach via extensive experiments performed by self-supervised training on two large-scale LiDAR datasets.
arXiv Detail & Related papers (2023-03-28T18:06:22Z) - Continual Barlow Twins: continual self-supervised learning for remote
sensing semantic segmentation [8.775728170359024]
We propose a new algorithm for merging Self-Supervised Learning (SSL) and Continual Learning (CL) for remote sensing applications, that we call Continual Barlow Twins (CBT)
CBT combines the advantages of one of the simplest self-supervision techniques, i.e., Barlow Twins, with the Elastic Weight Consolidation method to avoid catastrophic forgetting.
For the first time we evaluate SSL methods on a highly heterogeneous Earth Observation dataset, showing the effectiveness of these strategies.
arXiv Detail & Related papers (2022-05-23T14:02:12Z) - Towards Domain-Agnostic Contrastive Learning [103.40783553846751]
We propose a novel domain-agnostic approach to contrastive learning, named DACL.
Key to our approach is the use of Mixup noise to create similar and dissimilar examples by mixing data samples differently either at the input or hidden-state levels.
Our results show that DACL not only outperforms other domain-agnostic noising methods, such as Gaussian-noise, but also combines well with domain-specific methods, such as SimCLR.
arXiv Detail & Related papers (2020-11-09T13:41:56Z) - Online Descriptor Enhancement via Self-Labelling Triplets for Visual
Data Association [28.03285334702022]
We propose a self-supervised method for incrementally refining visual descriptors to improve performance in the task of object-level visual data association.
Our method optimize deep descriptor generators online, by continuously training a widely available image classification network pre-trained with domain-independent data.
We show that our approach surpasses other visual data-association methods applied to a tracking-by-detection task, and show that it provides better performance-gains when compared to other methods that attempt to adapt to observed information.
arXiv Detail & Related papers (2020-11-06T17:42:04Z) - Learning to Count in the Crowd from Limited Labeled Data [109.2954525909007]
We focus on reducing the annotation efforts by learning to count in the crowd from limited number of labeled samples.
Specifically, we propose a Gaussian Process-based iterative learning mechanism that involves estimation of pseudo-ground truth for the unlabeled data.
arXiv Detail & Related papers (2020-07-07T04:17:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.