Related papers: Explainable embeddings with Distance Explainer

Explainable embeddings with Distance Explainer

URL: http://arxiv.org/abs/2505.15516v1
Date: Wed, 21 May 2025 13:42:28 GMT
Title: Explainable embeddings with Distance Explainer
Authors: Christiaan Meijer, E. G. Patrick Bos,
Abstract summary: We introduce Distance Explainer, a novel method for generating local, post-hoc explanations of embedded spaces in machine learning models.<n>Our approach adapts saliency-based techniques from RISE to explain the distance between two embedded data points by assigning attribution values through selective masking and distance-ranked mask filtering.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While eXplainable AI (XAI) has advanced significantly, few methods address interpretability in embedded vector spaces where dimensions represent complex abstractions. We introduce Distance Explainer, a novel method for generating local, post-hoc explanations of embedded spaces in machine learning models. Our approach adapts saliency-based techniques from RISE to explain the distance between two embedded data points by assigning attribution values through selective masking and distance-ranked mask filtering. We evaluate Distance Explainer on cross-modal embeddings (image-image and image-caption pairs) using established XAI metrics including Faithfulness, Sensitivity/Robustness, and Randomization. Experiments with ImageNet and CLIP models demonstrate that our method effectively identifies features contributing to similarity or dissimilarity between embedded data points while maintaining high robustness and consistency. We also explore how parameter tuning, particularly mask quantity and selection strategy, affects explanation quality. This work addresses a critical gap in XAI research and enhances transparency and trustworthiness in deep learning applications utilizing embedded spaces.

Related papers

Zero-shot HOI Detection with MLLM-based Detector-agnostic Interaction Recognition [71.5328300638085]
Zero-shot Human-object interaction (HOI) detection aims to locate humans and objects in images and recognize their interactions.<n>Existing methods, including two-stage methods, tightly couple interaction recognition with a specific detector.<n>We propose a decoupled framework that separates object detection from IR and leverages multi-modal large language models (MLLMs) for zero-shot IR.
arXiv Detail & Related papers (2026-02-16T19:01:31Z)
Cross-Modal Geometric Hierarchy Fusion: An Implicit-Submap Driven Framework for Resilient 3D Place Recognition [4.196626042312499]
We propose a novel framework that redefines 3D place recognition through density-agnostic geometric reasoning.<n>Specifically, we introduce an implicit 3D representation based on elastic points, which is immune to the interference of original scene point cloud density.<n>With the aid of these two types of information, we obtain descriptors that fuse geometric information from both bird's-eye view and 3D segment perspectives.
arXiv Detail & Related papers (2025-06-17T07:04:07Z)
EdgeRegNet: Edge Feature-based Multimodal Registration Network between Images and LiDAR Point Clouds [10.324549723042338]
Cross-modal data registration has long been a critical task in computer vision.<n>We propose a method that uses edge information from the original point clouds and images for cross-modal registration.<n>We validate our method on the KITTI and nuScenes datasets, demonstrating its state-of-the-art performance.
arXiv Detail & Related papers (2025-03-19T15:03:41Z)
SMLNet: A SPD Manifold Learning Network for Infrared and Visible Image Fusion [60.18614468818683]
We propose a novel SPD (symmetric positive definite) manifold learning for multi-modal image fusion.<n>Our framework exhibits superior performance compared to the current state-of-the-art methods.
arXiv Detail & Related papers (2024-11-16T03:09:49Z)
GSSF: Generalized Structural Sparse Function for Deep Cross-modal Metric Learning [51.677086019209554]
We propose a Generalized Structural Sparse to capture powerful relationships across modalities for pair-wise similarity learning. The distance metric delicately encapsulates two formats of diagonal and block-diagonal terms. Experiments on cross-modal and two extra uni-modal retrieval tasks have validated its superiority and flexibility.
arXiv Detail & Related papers (2024-10-20T03:45:50Z)
Frequency-Spatial Entanglement Learning for Camouflaged Object Detection [34.426297468968485]
Existing methods attempt to reduce the impact of pixel similarity by maximizing the distinguishing ability of spatial features with complicated design. We propose a new approach to address this issue by jointly exploring the representation in the frequency and spatial domains, introducing the Frequency-Spatial Entanglement Learning (FSEL) method. Our experiments demonstrate the superiority of our FSEL over 21 state-of-the-art methods, through comprehensive quantitative and qualitative comparisons in three widely-used datasets.
arXiv Detail & Related papers (2024-09-03T07:58:47Z)
MICDrop: Masking Image and Depth Features via Complementary Dropout for Domain-Adaptive Semantic Segmentation [155.0797148367653]
Unsupervised Domain Adaptation (UDA) is the task of bridging the domain gap between a labeled source domain and an unlabeled target domain. We propose to leverage geometric information, i.e., depth predictions, as depth discontinuities often coincide with segmentation boundaries. We show that our method can be plugged into various recent UDA methods and consistently improve results across standard UDA benchmarks.
arXiv Detail & Related papers (2024-08-29T12:15:10Z)
MSSPlace: Multi-Sensor Place Recognition with Visual and Text Semantics [41.94295877935867]
We study the impact of leveraging a multi-camera setup and integrating diverse data sources for multimodal place recognition. Our proposed method named MSSPlace utilizes images from multiple cameras, LiDAR point clouds, semantic segmentation masks, and text annotations to generate comprehensive place descriptors.
arXiv Detail & Related papers (2024-07-22T14:24:56Z)
Improved LiDAR Odometry and Mapping using Deep Semantic Segmentation and Novel Outliers Detection [1.0334138809056097]
We propose a novel framework for real-time LiDAR odometry and mapping based on LOAM architecture for fast moving platforms. Our framework utilizes semantic information produced by a deep learning model to improve point-to-line and point-to-plane matching. We study the effect of improving the matching process on the robustness of LiDAR odometry against high speed motion.
arXiv Detail & Related papers (2024-03-05T16:53:24Z)
Differentiable Registration of Images and LiDAR Point Clouds with VoxelPoint-to-Pixel Matching [58.10418136917358]
Cross-modality registration between 2D images from cameras and 3D point clouds from LiDARs is a crucial task in computer vision and robotic training. Previous methods estimate 2D-3D correspondences by matching point and pixel patterns learned by neural networks. We learn a structured cross-modality matching solver to represent 3D features via a different latent pixel space.
arXiv Detail & Related papers (2023-12-07T05:46:10Z)
De-coupling and De-positioning Dense Self-supervised Learning [65.56679416475943]
Dense Self-Supervised Learning (SSL) methods address the limitations of using image-level feature representations when handling images with multiple objects. We show that they suffer from coupling and positional bias, which arise from the receptive field increasing with layer depth and zero-padding. We demonstrate the benefits of our method on COCO and on a new challenging benchmark, OpenImage-MINI, for object classification, semantic segmentation, and object detection.
arXiv Detail & Related papers (2023-03-29T18:07:25Z)
Generating detailed saliency maps using model-agnostic methods [0.0]
We focus on a model-agnostic explainability method called RISE, elaborate on observed shortcomings of its grid-based approach. modifications, collectively called VRISE (Voronoi-RISE), are meant to, respectively, improve the accuracy of maps generated using large occlusions. We compare accuracy of saliency maps produced by VRISE and RISE on the validation split of ILSVRC2012, using a saliency-guided content insertion/deletion metric and a localization metric based on bounding boxes.
arXiv Detail & Related papers (2022-09-04T21:34:46Z)
Robust Person Re-Identification through Contextual Mutual Boosting [77.1976737965566]
We propose the Contextual Mutual Boosting Network (CMBN) to localize pedestrians. It localizes pedestrians and recalibrates features by effectively exploiting contextual information and statistical inference. Experiments on the benchmarks demonstrate the superiority of the architecture compared the state-of-the-art.
arXiv Detail & Related papers (2020-09-16T06:33:35Z)
Learning Invariant Representations for Reinforcement Learning without Reconstruction [98.33235415273562]
We study how representation learning can accelerate reinforcement learning from rich observations, such as images, without relying either on domain knowledge or pixel-reconstruction. Bisimulation metrics quantify behavioral similarity between states in continuous MDPs. We demonstrate the effectiveness of our method at disregarding task-irrelevant information using modified visual MuJoCo tasks.
arXiv Detail & Related papers (2020-06-18T17:59:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.