Learning Embeddings with Centroid Triplet Loss for Object Identification in Robotic Grasping
- URL: http://arxiv.org/abs/2404.06277v2
- Date: Mon, 8 Jul 2024 15:29:27 GMT
- Title: Learning Embeddings with Centroid Triplet Loss for Object Identification in Robotic Grasping
- Authors: Anas Gouda, Max Schwarz, Christopher Reining, Sven Behnke, Alice Kirchheim,
- Abstract summary: Foundation models are a strong trend in deep learning and computer vision.
Here, we focus on training such an object identification model.
Key solution to train such a model is the centroid triplet loss (CTL), which aggregates image features to their centroids.
- Score: 14.958823096408175
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Foundation models are a strong trend in deep learning and computer vision. These models serve as a base for applications as they require minor or no further fine-tuning by developers to integrate into their applications. Foundation models for zero-shot object segmentation such as Segment Anything (SAM) output segmentation masks from images without any further object information. When they are followed in a pipeline by an object identification model, they can perform object detection without training. Here, we focus on training such an object identification model. A crucial practical aspect for an object identification model is to be flexible in input size. As object identification is an image retrieval problem, a suitable method should handle multi-query multi-gallery situations without constraining the number of input images (e.g. by having fixed-size aggregation layers). The key solution to train such a model is the centroid triplet loss (CTL), which aggregates image features to their centroids. CTL yields high accuracy, avoids misleading training signals and keeps the model input size flexible. In our experiments, we establish a new state of the art on the ArmBench object identification task, which shows general applicability of our model. We furthermore demonstrate an integrated unseen object detection pipeline on the challenging HOPE dataset, which requires fine-grained detection. There, our pipeline matches and surpasses related methods which have been trained on dataset-specific data.
Related papers
- Leveraging Foundation Models To learn the shape of semi-fluid deformable objects [0.7895162173260983]
A keen interest was manifested by researchers in the last decade to characterize and manipulate deformable objects of non-fluid nature.
In this paper, we address the subject of characterizing weld pool to define stable features that serve as information for motion control objectives.
The performance of knowledge distillation from foundation models into a smaller generative model shows prominent results in the characterization of deformable objects.
arXiv Detail & Related papers (2024-11-25T13:41:35Z) - MTP: Advancing Remote Sensing Foundation Model via Multi-Task Pretraining [73.81862342673894]
Foundation models have reshaped the landscape of Remote Sensing (RS) by enhancing various image interpretation tasks.
transferring the pretrained models to downstream tasks may encounter task discrepancy due to their formulation of pretraining as image classification or object discrimination tasks.
We conduct multi-task supervised pretraining on the SAMRS dataset, encompassing semantic segmentation, instance segmentation, and rotated object detection.
Our models are finetuned on various RS downstream tasks, such as scene classification, horizontal and rotated object detection, semantic segmentation, and change detection.
arXiv Detail & Related papers (2024-03-20T09:17:22Z) - GS-Pose: Category-Level Object Pose Estimation via Geometric and
Semantic Correspondence [5.500735640045456]
Category-level pose estimation is a challenging task with many potential applications in computer vision and robotics.
We propose to utilize both geometric and semantic features obtained from a pre-trained foundation model.
This requires significantly less data to train than prior methods since the semantic features are robust to object texture and appearance.
arXiv Detail & Related papers (2023-11-23T02:35:38Z) - Delving Deeper into Data Scaling in Masked Image Modeling [145.36501330782357]
We conduct an empirical study on the scaling capability of masked image modeling (MIM) methods for visual recognition.
Specifically, we utilize the web-collected Coyo-700M dataset.
Our goal is to investigate how the performance changes on downstream tasks when scaling with different sizes of data and models.
arXiv Detail & Related papers (2023-05-24T15:33:46Z) - DR-WLC: Dimensionality Reduction cognition for object detection and pose
estimation by Watching, Learning and Checking [30.58114448119465]
Existing object detection and pose estimation methods mostly adopt the same-dimensional data for training.
DR-WLC, a dimensionality reduction cognitive model, can perform both object detection and pose estimation tasks at the same time.
arXiv Detail & Related papers (2023-01-17T15:08:32Z) - MegaPose: 6D Pose Estimation of Novel Objects via Render & Compare [84.80956484848505]
MegaPose is a method to estimate the 6D pose of novel objects, that is, objects unseen during training.
We present a 6D pose refiner based on a render&compare strategy which can be applied to novel objects.
Second, we introduce a novel approach for coarse pose estimation which leverages a network trained to classify whether the pose error between a synthetic rendering and an observed image of the same object can be corrected by the refiner.
arXiv Detail & Related papers (2022-12-13T19:30:03Z) - Dynamic Relevance Learning for Few-Shot Object Detection [6.550840743803705]
We propose a dynamic relevance learning model, which utilizes the relationship between all support images and Region of Interest (RoI) on the query images to construct a dynamic graph convolutional network (GCN)
The proposed model achieves the best overall performance, which shows its effectiveness of learning more generalized features.
arXiv Detail & Related papers (2021-08-04T18:29:42Z) - Rectifying the Shortcut Learning of Background: Shared Object
Concentration for Few-Shot Image Recognition [101.59989523028264]
Few-Shot image classification aims to utilize pretrained knowledge learned from a large-scale dataset to tackle a series of downstream classification tasks.
We propose COSOC, a novel Few-Shot Learning framework, to automatically figure out foreground objects at both pretraining and evaluation stage.
arXiv Detail & Related papers (2021-07-16T07:46:41Z) - Salient Objects in Clutter [130.63976772770368]
This paper identifies and addresses a serious design bias of existing salient object detection (SOD) datasets.
This design bias has led to a saturation in performance for state-of-the-art SOD models when evaluated on existing datasets.
We propose a new high-quality dataset and update the previous saliency benchmark.
arXiv Detail & Related papers (2021-05-07T03:49:26Z) - Few-Cost Salient Object Detection with Adversarial-Paced Learning [95.0220555274653]
This paper proposes to learn the effective salient object detection model based on the manual annotation on a few training images only.
We name this task as the few-cost salient object detection and propose an adversarial-paced learning (APL)-based framework to facilitate the few-cost learning scenario.
arXiv Detail & Related papers (2021-04-05T14:15:49Z) - Few-shot Object Detection on Remote Sensing Images [11.40135025181393]
We introduce a few-shot learning-based method for object detection on remote sensing images.
We build our few-shot object detection model upon YOLOv3 architecture and develop a multi-scale object detection framework.
arXiv Detail & Related papers (2020-06-14T07:18:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.