CAPE: A CLIP-Aware Pointing Ensemble of Complementary Heatmap Cues for Embodied Reference Understanding
- URL: http://arxiv.org/abs/2507.21888v1
- Date: Tue, 29 Jul 2025 15:00:21 GMT
- Title: CAPE: A CLIP-Aware Pointing Ensemble of Complementary Heatmap Cues for Embodied Reference Understanding
- Authors: Fevziye Irem Eyiokur, Dogucan Yaman, Hazım Kemal Ekenel, Alexander Waibel,
- Abstract summary: Embodied Reference Understanding involves predicting the object that a person in the scene is referring to through both pointing gesture and language.<n>We propose a dual-model framework, where one model learns from the head-to-fingertip direction and the other from the wrist-to-fingertip direction.<n>We present the CLIP-Aware Pointing Ensemble module, which performs a hybrid ensemble based on CLIP features.
- Score: 55.33317649771575
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We address the problem of Embodied Reference Understanding, which involves predicting the object that a person in the scene is referring to through both pointing gesture and language. Accurately identifying the referent requires multimodal understanding: integrating textual instructions, visual pointing, and scene context. However, existing methods often struggle to effectively leverage visual clues for disambiguation. We also observe that, while the referent is often aligned with the head-to-fingertip line, it occasionally aligns more closely with the wrist-to-fingertip line. Therefore, relying on a single line assumption can be overly simplistic and may lead to suboptimal performance. To address this, we propose a dual-model framework, where one model learns from the head-to-fingertip direction and the other from the wrist-to-fingertip direction. We further introduce a Gaussian ray heatmap representation of these lines and use them as input to provide a strong supervisory signal that encourages the model to better attend to pointing cues. To combine the strengths of both models, we present the CLIP-Aware Pointing Ensemble module, which performs a hybrid ensemble based on CLIP features. Additionally, we propose an object center prediction head as an auxiliary task to further enhance referent localization. We validate our approach through extensive experiments and analysis on the benchmark YouRefIt dataset, achieving an improvement of approximately 4 mAP at the 0.25 IoU threshold.
Related papers
- Jamais Vu: Exposing the Generalization Gap in Supervised Semantic Correspondence [37.26437707181298]
We propose a novel approach for learning dense correspondences by lifting 2D keypoints into a canonical 3D space using monocular depth estimation.<n>Our method constructs a continuous canonical manifold that captures object geometry without requiring explicit 3D supervision or camera annotations.
arXiv Detail & Related papers (2025-06-09T20:40:47Z) - PointCG: Self-supervised Point Cloud Learning via Joint Completion and Generation [32.04698431036215]
In this paper, we integrate two prevalent methods, masked point modeling (MPM) and 3D-to-2D generation, as pretext tasks within a pre-training framework.
We leverage the spatial awareness and precise supervision offered by these two methods to address their respective limitations.
arXiv Detail & Related papers (2024-11-09T02:38:29Z) - Dual-Image Enhanced CLIP for Zero-Shot Anomaly Detection [58.228940066769596]
We introduce a Dual-Image Enhanced CLIP approach, leveraging a joint vision-language scoring system.
Our methods process pairs of images, utilizing each as a visual reference for the other, thereby enriching the inference process with visual context.
Our approach significantly exploits the potential of vision-language joint anomaly detection and demonstrates comparable performance with current SOTA methods across various datasets.
arXiv Detail & Related papers (2024-05-08T03:13:20Z) - Exploiting Point-Wise Attention in 6D Object Pose Estimation Based on
Bidirectional Prediction [22.894810893732416]
The paper proposes a bidirectional correspondence prediction network with a point-wise attention-aware mechanism.
Our key insight is that the correlations between each model point and scene point provide essential information for learning point-pair matches.
Experimental results on the public datasets of LineMOD, YCB-Video, and Occ-LineMOD show that the proposed method achieves better performance than other state-of-the-art methods.
arXiv Detail & Related papers (2023-08-16T17:13:45Z) - Rethinking Range View Representation for LiDAR Segmentation [66.73116059734788]
"Many-to-one" mapping, semantic incoherence, and shape deformation are possible impediments against effective learning from range view projections.
We present RangeFormer, a full-cycle framework comprising novel designs across network architecture, data augmentation, and post-processing.
We show that, for the first time, a range view method is able to surpass the point, voxel, and multi-view fusion counterparts in the competing LiDAR semantic and panoptic segmentation benchmarks.
arXiv Detail & Related papers (2023-03-09T16:13:27Z) - Line Graph Contrastive Learning for Link Prediction [4.876567687745239]
We propose a Line Graph Contrastive Learning (LGCL) method to obtain multiview information.
With experiments on six public datasets, LGCL outperforms current benchmarks on link prediction tasks.
arXiv Detail & Related papers (2022-10-25T06:57:00Z) - Weakly Supervised Video Salient Object Detection via Point Supervision [18.952253968878356]
We propose a strong baseline model based on point supervision.
To infer saliency maps with temporal information, we mine inter-frame complementary information from short-term and long-term perspectives.
We label two point-supervised datasets, P-DAVIS and P-DAVSOD, by relabeling the DAVIS and the DAVSOD dataset.
arXiv Detail & Related papers (2022-07-15T03:31:15Z) - Rethinking Counting and Localization in Crowds:A Purely Point-Based
Framework [59.578339075658995]
We propose a purely point-based framework for joint crowd counting and individual localization.
We design an intuitive solution under this framework, which is called Point to Point Network (P2PNet)
arXiv Detail & Related papers (2021-07-27T11:41:50Z) - SOLD2: Self-supervised Occlusion-aware Line Description and Detection [95.8719432775724]
We introduce the first joint detection and description of line segments in a single deep network.
Our method does not require any annotated line labels and can therefore generalize to any dataset.
We evaluate our approach against previous line detection and description methods on several multi-view datasets.
arXiv Detail & Related papers (2021-04-07T19:27:17Z) - Articulation-aware Canonical Surface Mapping [54.0990446915042]
We tackle the tasks of predicting a Canonical Surface Mapping (CSM) that indicates the mapping from 2D pixels to corresponding points on a canonical template shape, and inferring the articulation and pose of the template corresponding to the input image.
Our key insight is that these tasks are geometrically related, and we can obtain supervisory signal via enforcing consistency among the predictions.
We empirically show that allowing articulation helps learn more accurate CSM prediction, and that enforcing the consistency with predicted CSM is similarly critical for learning meaningful articulation.
arXiv Detail & Related papers (2020-04-01T17:56:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.