Related papers: CAPE: A CLIP-Aware Pointing Ensemble of Complementary Heatmap Cues for Embodied Reference Understanding

CAPE: A CLIP-Aware Pointing Ensemble of Complementary Heatmap Cues for Embodied Reference Understanding

URL: http://arxiv.org/abs/2507.21888v2
Date: Fri, 10 Oct 2025 16:18:55 GMT
Title: CAPE: A CLIP-Aware Pointing Ensemble of Complementary Heatmap Cues for Embodied Reference Understanding
Authors: Fevziye Irem Eyiokur, Dogucan Yaman, Hazım Kemal Ekenel, Alexander Waibel,
Abstract summary: Embodied Reference Understanding involves predicting the object that a person in the scene is referring to through both pointing gesture and language.<n>We propose a dual-model framework, where one model learns from the head-to-fingertip direction and the other from the wrist-to-fingertip direction.<n>We validate our approach through extensive experiments and analysis on the benchmark YouRefIt dataset, achieving an improvement of approximately 4 mAP at the 0.25 IoU threshold.
Score: 56.30142869506262
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We address the problem of Embodied Reference Understanding, which involves predicting the object that a person in the scene is referring to through both pointing gesture and language. Accurately identifying the referent requires multimodal understanding: integrating textual instructions, visual pointing, and scene context. However, existing methods often struggle to effectively leverage visual clues for disambiguation. We also observe that, while the referent is often aligned with the head-to-fingertip line, it occasionally aligns more closely with the wrist-to-fingertip line. Therefore, relying on a single line assumption can be overly simplistic and may lead to suboptimal performance. To address this, we propose a dual-model framework, where one model learns from the head-to-fingertip direction and the other from the wrist-to-fingertip direction. We further introduce a Gaussian ray heatmap representation of these lines and use them as input to provide a strong supervisory signal that encourages the model to better attend to pointing cues. To combine the strengths of both models, we present the CLIP-Aware Pointing Ensemble module, which performs a hybrid ensemble based on CLIP features. Additionally, we propose an object center prediction head as an auxiliary task to further enhance referent localization. We validate our approach through extensive experiments and analysis on the benchmark YouRefIt dataset, achieving an improvement of approximately 4 mAP at the 0.25 IoU threshold. We further evaluate our approach on the CAESAR and ISL Pointing datasets.

Related papers

Visual Self-Refine: A Pixel-Guided Paradigm for Accurate Chart Parsing [76.2602505940467]
Existing models often struggle with visually dense charts, leading to errors like data omission, misalignment, and hallucination.<n>Inspired by the human strategy of using a finger as a visual anchor'' to ensure accuracy when reading complex charts, we propose a new paradigm named Visual Self-Refine (VSR)<n>The core idea of VSR is to enable a model to generate pixel-level localization outputs, visualize them, and then feed these visualizations back to itself, allowing it to intuitively inspect and correct its own potential visual perception errors.
arXiv Detail & Related papers (2026-02-18T13:40:53Z)
Boosting Point-supervised Temporal Action Localization via Text Refinement and Alignment [66.80402022104074]
We propose a Text Refinement and Alignment (TRA) framework that effectively utilizes textual features from visual descriptions to complement the visual features as they are semantically rich.<n>This is achieved by designing two new modules for the original point-supervised framework: a Point-based Text Refinement module (PTR) and a Point-based Multimodal Alignment module (PMA)
arXiv Detail & Related papers (2026-02-01T14:35:46Z)
Jamais Vu: Exposing the Generalization Gap in Supervised Semantic Correspondence [37.26437707181298]
We propose a novel approach for learning dense correspondences by lifting 2D keypoints into a canonical 3D space using monocular depth estimation.<n>Our method constructs a continuous canonical manifold that captures object geometry without requiring explicit 3D supervision or camera annotations.
arXiv Detail & Related papers (2025-06-09T20:40:47Z)
Just Functioning as a Hook for Two-Stage Referring Multi-Object Tracking [22.669740476582835]
Referring Multi-Object Tracking aims to localize target trajectories in videos specified by natural language expressions.<n>We present a systematic analysis of the intrinsic relationship between the two subtasks of tracking and referring in RMOT.<n>We propose JustHook, a novel two-stage RBT framework where a Hook module is firstly designed to redefine the linkage between subtasks.
arXiv Detail & Related papers (2025-03-10T16:38:42Z)
PointCG: Self-supervised Point Cloud Learning via Joint Completion and Generation [32.04698431036215]
In this paper, we integrate two prevalent methods, masked point modeling (MPM) and 3D-to-2D generation, as pretext tasks within a pre-training framework. We leverage the spatial awareness and precise supervision offered by these two methods to address their respective limitations.
arXiv Detail & Related papers (2024-11-09T02:38:29Z)
Dual-Image Enhanced CLIP for Zero-Shot Anomaly Detection [58.228940066769596]
We introduce a Dual-Image Enhanced CLIP approach, leveraging a joint vision-language scoring system. Our methods process pairs of images, utilizing each as a visual reference for the other, thereby enriching the inference process with visual context. Our approach significantly exploits the potential of vision-language joint anomaly detection and demonstrates comparable performance with current SOTA methods across various datasets.
arXiv Detail & Related papers (2024-05-08T03:13:20Z)
Exploiting Point-Wise Attention in 6D Object Pose Estimation Based on Bidirectional Prediction [22.894810893732416]
The paper proposes a bidirectional correspondence prediction network with a point-wise attention-aware mechanism. Our key insight is that the correlations between each model point and scene point provide essential information for learning point-pair matches. Experimental results on the public datasets of LineMOD, YCB-Video, and Occ-LineMOD show that the proposed method achieves better performance than other state-of-the-art methods.
arXiv Detail & Related papers (2023-08-16T17:13:45Z)
Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner. We design a semantic-guided self-supervised learning model to extract high-level semantic features from images. We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z)
Rethinking Range View Representation for LiDAR Segmentation [66.73116059734788]
"Many-to-one" mapping, semantic incoherence, and shape deformation are possible impediments against effective learning from range view projections. We present RangeFormer, a full-cycle framework comprising novel designs across network architecture, data augmentation, and post-processing. We show that, for the first time, a range view method is able to surpass the point, voxel, and multi-view fusion counterparts in the competing LiDAR semantic and panoptic segmentation benchmarks.
arXiv Detail & Related papers (2023-03-09T16:13:27Z)
Line Graph Contrastive Learning for Link Prediction [4.876567687745239]
We propose a Line Graph Contrastive Learning (LGCL) method to obtain multiview information. With experiments on six public datasets, LGCL outperforms current benchmarks on link prediction tasks.
arXiv Detail & Related papers (2022-10-25T06:57:00Z)
Weakly Supervised Video Salient Object Detection via Point Supervision [18.952253968878356]
We propose a strong baseline model based on point supervision. To infer saliency maps with temporal information, we mine inter-frame complementary information from short-term and long-term perspectives. We label two point-supervised datasets, P-DAVIS and P-DAVSOD, by relabeling the DAVIS and the DAVSOD dataset.
arXiv Detail & Related papers (2022-07-15T03:31:15Z)
Rethinking Counting and Localization in Crowds:A Purely Point-Based Framework [59.578339075658995]
We propose a purely point-based framework for joint crowd counting and individual localization. We design an intuitive solution under this framework, which is called Point to Point Network (P2PNet)
arXiv Detail & Related papers (2021-07-27T11:41:50Z)
SOLD2: Self-supervised Occlusion-aware Line Description and Detection [95.8719432775724]
We introduce the first joint detection and description of line segments in a single deep network. Our method does not require any annotated line labels and can therefore generalize to any dataset. We evaluate our approach against previous line detection and description methods on several multi-view datasets.
arXiv Detail & Related papers (2021-04-07T19:27:17Z)
Articulation-aware Canonical Surface Mapping [54.0990446915042]
We tackle the tasks of predicting a Canonical Surface Mapping (CSM) that indicates the mapping from 2D pixels to corresponding points on a canonical template shape, and inferring the articulation and pose of the template corresponding to the input image. Our key insight is that these tasks are geometrically related, and we can obtain supervisory signal via enforcing consistency among the predictions. We empirically show that allowing articulation helps learn more accurate CSM prediction, and that enforcing the consistency with predicted CSM is similarly critical for learning meaningful articulation.
arXiv Detail & Related papers (2020-04-01T17:56:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.