Related papers: A Diffusion-based Data Generator for Training Object Recognition Models in Ultra-Range Distance

A Diffusion-based Data Generator for Training Object Recognition Models in Ultra-Range Distance

URL: http://arxiv.org/abs/2404.09846v2
Date: Sat, 23 Nov 2024 16:55:30 GMT
Title: A Diffusion-based Data Generator for Training Object Recognition Models in Ultra-Range Distance
Authors: Eran Bamani, Eden Nissinman, Lisa Koenigsberg, Inbar Meir, Avishai Sintov,
Abstract summary: Training a model to recognize hardly visible objects located in ultra-range requires an exhaustive collection of labeled samples. We propose the Diffusion in Ultra-Range (DUR) framework based on a Diffusion model to generate labeled images of distant objects in various scenes. DUR is compared to other types of generative models showcasing superiority both in fidelity and in recognition success rate when training a URGR model.
Score: 2.240453048130742
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Object recognition, commonly performed by a camera, is a fundamental requirement for robots to complete complex tasks. Some tasks require recognizing objects far from the robot's camera. A challenging example is Ultra-Range Gesture Recognition (URGR) in human-robot interaction where the user exhibits directive gestures at a distance of up to 25~m from the robot. However, training a model to recognize hardly visible objects located in ultra-range requires an exhaustive collection of a significant amount of labeled samples. The generation of synthetic training datasets is a recent solution to the lack of real-world data, while unable to properly replicate the realistic visual characteristics of distant objects in images. In this letter, we propose the Diffusion in Ultra-Range (DUR) framework based on a Diffusion model to generate labeled images of distant objects in various scenes. The DUR generator receives a desired distance and class (e.g., gesture) and outputs a corresponding synthetic image. We apply DUR to train a URGR model with directive gestures in which fine details of the gesturing hand are challenging to distinguish. DUR is compared to other types of generative models showcasing superiority both in fidelity and in recognition success rate when training a URGR model. More importantly, training a DUR model on a limited amount of real data and then using it to generate synthetic data for training a URGR model outperforms directly training the URGR model on real data. The synthetic-based URGR model is also demonstrated in gesture-based direction of a ground robot.

Related papers

Synthetic Dataset Generation for Autonomous Mobile Robots Using 3D Gaussian Splatting for Vision Training [0.708987965338602]
We propose a novel method for automatically generating annotated synthetic data in Unreal Engine.<n>We demonstrate that synthetic datasets can achieve performance comparable to that of real-world datasets.<n>This is the first application of synthetic data for training object detection algorithms in robot soccer.
arXiv Detail & Related papers (2025-06-05T14:37:40Z)
Sim2Real Transfer for Vision-Based Grasp Verification [7.9471205712560264]
We present a vision-based approach for grasp verification to determine whether the robotic gripper has successfully grasped an object.<n>Our method employs a two-stage architecture; first YOLO-based object detection model to detect and locate the robot's gripper.<n>To address the limitations of real-world data capture, we introduce HSR-Grasp Synth, a synthetic dataset designed to simulate diverse grasping scenarios.
arXiv Detail & Related papers (2025-05-05T22:04:12Z)
RIGID: A Training-free and Model-Agnostic Framework for Robust AI-Generated Image Detection [60.960988614701414]
RIGID is a training-free and model-agnostic method for robust AI-generated image detection. RIGID significantly outperforms existing trainingbased and training-free detectors.
arXiv Detail & Related papers (2024-05-30T14:49:54Z)
AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents [109.3804962220498]
AutoRT is a system to scale up the deployment of operational robots in completely unseen scenarios with minimal human supervision. We demonstrate AutoRT proposing instructions to over 20 robots across multiple buildings and collecting 77k real robot episodes via both teleoperation and autonomous robot policies. We experimentally show that such "in-the-wild" data collected by AutoRT is significantly more diverse, and that AutoRT's use of LLMs allows for instruction following data collection robots that can align to human preferences.
arXiv Detail & Related papers (2024-01-23T18:45:54Z)
Is synthetic data from generative models ready for image recognition? [69.42645602062024]
We study whether and how synthetic images generated from state-of-the-art text-to-image generation models can be used for image recognition tasks. We showcase the powerfulness and shortcomings of synthetic data from existing generative models, and propose strategies for better applying synthetic data for recognition tasks.
arXiv Detail & Related papers (2022-10-14T06:54:24Z)
Towards Multimodal Multitask Scene Understanding Models for Indoor Mobile Agents [49.904531485843464]
In this paper, we discuss the main challenge: insufficient, or even no, labeled data for real-world indoor environments. We describe MMISM (Multi-modality input Multi-task output Indoor Scene understanding Model) to tackle the above challenges. MMISM considers RGB images as well as sparse Lidar points as inputs and 3D object detection, depth completion, human pose estimation, and semantic segmentation as output tasks. We show that MMISM performs on par or even better than single-task models.
arXiv Detail & Related papers (2022-09-27T04:49:19Z)
PACT: Perception-Action Causal Transformer for Autoregressive Robotics Pre-Training [25.50131893785007]
This work introduces a paradigm for pre-training a general purpose representation that can serve as a starting point for multiple tasks on a given robot. We present the Perception-Action Causal Transformer (PACT), a generative transformer-based architecture that aims to build representations directly from robot data in a self-supervised fashion. We show that finetuning small task-specific networks on top of the larger pretrained model results in significantly better performance compared to training a single model from scratch for all tasks simultaneously.
arXiv Detail & Related papers (2022-09-22T16:20:17Z)
Real-to-Sim: Predicting Residual Errors of Robotic Systems with Sparse Data using a Learning-based Unscented Kalman Filter [65.93205328894608]
We learn the residual errors between a dynamic and/or simulator model and the real robot. We show that with the learned residual errors, we can further close the reality gap between dynamic models, simulations, and actual hardware.
arXiv Detail & Related papers (2022-09-07T15:15:12Z)
Conditional Generation of Synthetic Geospatial Images from Pixel-level and Feature-level Inputs [0.0]
We present a conditional generative model, called VAE-Info-cGAN, for synthesizing semantically rich images simultaneously conditioned on a pixel-level condition (PLC) and a feature-level condition (FLC) The proposed model can accurately generate various forms of macroscopic aggregates across different geographic locations while conditioned only on atemporal representation of the road network.
arXiv Detail & Related papers (2021-09-11T06:58:19Z)
Few-Shot Visual Grounding for Natural Human-Robot Interaction [0.0]
We propose a software architecture that segments a target object from a crowded scene, indicated verbally by a human user. At the core of our system, we employ a multi-modal deep neural network for visual grounding. We evaluate the performance of the proposed model on real RGB-D data collected from public scene datasets.
arXiv Detail & Related papers (2021-03-17T15:24:02Z)
Where is my hand? Deep hand segmentation for visual self-recognition in humanoid robots [129.46920552019247]
We propose the use of a Convolution Neural Network (CNN) to segment the robot hand from an image in an egocentric view. We fine-tuned the Mask-RCNN network for the specific task of segmenting the hand of the humanoid robot Vizzy.
arXiv Detail & Related papers (2021-02-09T10:34:32Z)
VAE-Info-cGAN: Generating Synthetic Images by Combining Pixel-level and Feature-level Geospatial Conditional Inputs [0.0]
We present a conditional generative model for synthesizing semantically rich images simultaneously conditioned on a pixellevel (PLC) and a featurelevel condition (FLC) Experiments on a GPS dataset show that the proposed model can accurately generate various forms of macroscopic aggregates across different geographic locations.
arXiv Detail & Related papers (2020-12-08T03:46:19Z)
PennSyn2Real: Training Object Recognition Models without Human Labeling [12.923677573437699]
We propose PennSyn2Real - a synthetic dataset consisting of more than 100,000 4K images of more than 20 types of micro aerial vehicles (MAVs) The dataset can be used to generate arbitrary numbers of training images for high-level computer vision tasks such as MAV detection and classification. We show that synthetic data generated using this framework can be directly used to train CNN models for common object recognition tasks such as detection and segmentation.
arXiv Detail & Related papers (2020-09-22T02:53:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.