Related papers: Graph-based Active Learning for Entity Cluster Repair

Graph-based Active Learning for Entity Cluster Repair

URL: http://arxiv.org/abs/2401.14992v1
Date: Fri, 26 Jan 2024 16:42:49 GMT
Title: Graph-based Active Learning for Entity Cluster Repair
Authors: Victor Christen, Daniel Obraczka, Marvin Hofer, Martin Franke, Erhard Rahm
Abstract summary: Cluster repair methods aim to determine errors in clusters and modify them so that each cluster consists of records representing the same entity. Current cluster repair methodologies assume duplicate-free data sources, where each record from one source corresponds to a unique record from another. Recent approaches apply clustering methods in combination with link categorization methods so they can be applied to data sources with duplicates. We introduce a novel approach for cluster repair that utilizes graph metrics derived from the underlying similarity graphs.
Score: 1.7453520331111723
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Cluster repair methods aim to determine errors in clusters and modify them so that each cluster consists of records representing the same entity. Current cluster repair methodologies primarily assume duplicate-free data sources, where each record from one source corresponds to a unique record from another. However, real-world data often deviates from this assumption due to quality issues. Recent approaches apply clustering methods in combination with link categorization methods so they can be applied to data sources with duplicates. Nevertheless, the results do not show a clear picture since the quality highly varies depending on the configuration and dataset. In this study, we introduce a novel approach for cluster repair that utilizes graph metrics derived from the underlying similarity graphs. These metrics are pivotal in constructing a classification model to distinguish between correct and incorrect edges. To address the challenge of limited training data, we integrate an active learning mechanism tailored to cluster-specific attributes. The evaluation shows that the method outperforms existing cluster repair methods without distinguishing between duplicate-free or dirty data sources. Notably, our modified active learning strategy exhibits enhanced performance when dealing with datasets containing duplicates, showcasing its effectiveness in such scenarios.

Related papers

Learning Unified Distance Metric for Heterogeneous Attribute Data Clustering [60.05209293008078]
Heterogeneous Attribute Reconstruction and Representation (HARR) learning paradigm for cluster analysis.<n>HarR is parameter-free, convergence-guaranteed, and can more effectively self-adapt to different sought number of clusters $k$.
arXiv Detail & Related papers (2026-03-03T08:13:16Z)
Order is All You Need for Categorical Data Clustering [31.851890008893847]
This paper introduces a new finding that the order relation among attribute values is the decisive factor in clustering accuracy. We propose a new learning paradigm that allows joint learning of clusters and the orders. The algorithm achieves superior clustering accuracy with a convergence guarantee.
arXiv Detail & Related papers (2024-11-19T08:23:25Z)
Discriminative Anchor Learning for Efficient Multi-view Clustering [59.11406089896875]
We propose discriminative anchor learning for multi-view clustering (DALMC) We learn discriminative view-specific feature representations according to the original dataset. We build anchors from different views based on these representations, which increase the quality of the shared anchor graph.
arXiv Detail & Related papers (2024-09-25T13:11:17Z)
Self Supervised Correlation-based Permutations for Multi-View Clustering [7.972599673048582]
We propose an end-to-end deep learning-based MVC framework for general data. Our approach involves learning meaningful fused data representations with a novel permutation-based canonical correlation objective. We demonstrate the effectiveness of our model using ten MVC benchmark datasets.
arXiv Detail & Related papers (2024-02-26T08:08:30Z)
Generalized Category Discovery with Clustering Assignment Consistency [56.92546133591019]
Generalized category discovery (GCD) is a recently proposed open-world task. We propose a co-training-based framework that encourages clustering consistency. Our method achieves state-of-the-art performance on three generic benchmarks and three fine-grained visual recognition datasets.
arXiv Detail & Related papers (2023-10-30T00:32:47Z)
AugDMC: Data Augmentation Guided Deep Multiple Clustering [2.479720095773358]
AugDMC is a novel data Augmentation guided Deep Multiple Clustering method. It exploits data augmentations to automatically extract features related to a certain aspect of the data. A stable optimization strategy is proposed to alleviate the unstable problem from different augmentations.
arXiv Detail & Related papers (2023-06-22T16:31:46Z)
ClusterNet: A Perception-Based Clustering Model for Scattered Data [16.326062082938215]
Cluster separation in scatterplots is a task that is typically tackled by widely used clustering techniques. We propose a learning strategy which directly operates on scattered data. We train ClusterNet, a point-based deep learning model, trained to reflect human perception of cluster separability.
arXiv Detail & Related papers (2023-04-27T13:41:12Z)
Hard Regularization to Prevent Deep Online Clustering Collapse without Data Augmentation [65.268245109828]
Online deep clustering refers to the joint use of a feature extraction network and a clustering model to assign cluster labels to each new data point or batch as it is processed. While faster and more versatile than offline methods, online clustering can easily reach the collapsed solution where the encoder maps all inputs to the same point and all are put into a single cluster. We propose a method that does not require data augmentation, and that, differently from existing methods, regularizes the hard assignments.
arXiv Detail & Related papers (2023-03-29T08:23:26Z)
Inv-SENnet: Invariant Self Expression Network for clustering under biased data [17.25929452126843]
We propose a novel framework for jointly removing unwanted attributes (biases) while learning to cluster data points in individual subspaces. Our experimental result on synthetic and real-world datasets demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2022-11-13T01:19:06Z)
Anomaly Clustering: Grouping Images into Coherent Clusters of Anomaly Types [60.45942774425782]
We introduce anomaly clustering, whose goal is to group data into coherent clusters of anomaly types. This is different from anomaly detection, whose goal is to divide anomalies from normal data. We present a simple yet effective clustering framework using a patch-based pretrained deep embeddings and off-the-shelf clustering methods.
arXiv Detail & Related papers (2021-12-21T23:11:33Z)
Structured Graph Learning for Clustering and Semi-supervised Classification [74.35376212789132]
We propose a graph learning framework to preserve both the local and global structure of data. Our method uses the self-expressiveness of samples to capture the global structure and adaptive neighbor approach to respect the local structure. Our model is equivalent to a combination of kernel k-means and k-means methods under certain condition.
arXiv Detail & Related papers (2020-08-31T08:41:20Z)
reval: a Python package to determine best clustering solutions with stability-based relative clustering validation [1.8129328638036126]
reval is a Python package that leverages stability-based relative clustering validation methods to determine best clustering solutions. This work aims at developing a stability-based method that selects the best clustering solution as the one that replicates, via supervised learning, on unseen subsets of data.
arXiv Detail & Related papers (2020-08-27T10:36:56Z)
Unsupervised Person Re-identification via Softened Similarity Learning [122.70472387837542]
Person re-identification (re-ID) is an important topic in computer vision. This paper studies the unsupervised setting of re-ID, which does not require any labeled information. Experiments on two image-based and video-based datasets demonstrate state-of-the-art performance.
arXiv Detail & Related papers (2020-04-07T17:16:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.