X-Capture: An Open-Source Portable Device for Multi-Sensory Learning
- URL: http://arxiv.org/abs/2504.02318v1
- Date: Thu, 03 Apr 2025 06:44:25 GMT
- Title: X-Capture: An Open-Source Portable Device for Multi-Sensory Learning
- Authors: Samuel Clarke, Suzannah Wistreich, Yanjie Ze, Jiajun Wu,
- Abstract summary: We introduce X-Capture, an open-source, portable, and cost-effective device for real-world multi-sensory data collection.<n>With a build cost under $1,000, X-Capture democratizes the creation of multi-sensory datasets.<n>X-Capture lays the groundwork for advancing human-like sensory representations in AI.
- Score: 11.632896115888261
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding objects through multiple sensory modalities is fundamental to human perception, enabling cross-sensory integration and richer comprehension. For AI and robotic systems to replicate this ability, access to diverse, high-quality multi-sensory data is critical. Existing datasets are often limited by their focus on controlled environments, simulated objects, or restricted modality pairings. We introduce X-Capture, an open-source, portable, and cost-effective device for real-world multi-sensory data collection, capable of capturing correlated RGBD images, tactile readings, and impact audio. With a build cost under $1,000, X-Capture democratizes the creation of multi-sensory datasets, requiring only consumer-grade tools for assembly. Using X-Capture, we curate a sample dataset of 3,000 total points on 500 everyday objects from diverse, real-world environments, offering both richness and variety. Our experiments demonstrate the value of both the quantity and the sensory breadth of our data for both pretraining and fine-tuning multi-modal representations for object-centric tasks such as cross-sensory retrieval and reconstruction. X-Capture lays the groundwork for advancing human-like sensory representations in AI, emphasizing scalability, accessibility, and real-world applicability.
Related papers
- 3D-ViTac: Learning Fine-Grained Manipulation with Visuo-Tactile Sensing [18.189782619503074]
This paper introduces textbf3D-ViTac, a multi-modal sensing and learning system for robots.
Our system features tactile sensors equipped with dense sensing units, each covering an area of 3$mm2$.
We show that even low-cost robots can perform precise manipulations and significantly outperform vision-only policies.
arXiv Detail & Related papers (2024-10-31T16:22:53Z) - X-Fi: A Modality-Invariant Foundation Model for Multimodal Human Sensing [14.549639729808717]
Current human sensing primarily depends on cameras and LiDAR, each of which has its own strengths and limitations.<n>Existing multi-modal fusion solutions are typically designed for fixed modality combinations.<n>We propose a modality-invariant foundation model for all modalities, X-Fi, to address this issue.
arXiv Detail & Related papers (2024-10-14T05:23:12Z) - MuJo: Multimodal Joint Feature Space Learning for Human Activity Recognition [2.7532797256542403]
Human activity recognition (HAR) is a long-standing problem in artificial intelligence with applications in a broad range of areas.<n>We introduce our comprehensive Fitness Multimodal Activity dataset (FiMAD) to enhance HAR performance across various modalities.<n>We show that FiMAD can increase the performance on real HAR datasets such as MM-Fit, MyoGym, MotionSense, and MHEALTH.
arXiv Detail & Related papers (2024-06-06T08:42:36Z) - MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in
3D World [55.878173953175356]
We propose MultiPLY, a multisensory embodied large language model.
We first collect Multisensory Universe, a large-scale multisensory interaction dataset comprising 500k data.
We demonstrate that MultiPLY outperforms baselines by a large margin through a diverse set of embodied tasks.
arXiv Detail & Related papers (2024-01-16T18:59:45Z) - Aria-NeRF: Multimodal Egocentric View Synthesis [17.0554791846124]
We seek to accelerate research in developing rich, multimodal scene models trained from egocentric data, based on differentiable volumetric ray-tracing inspired by Neural Radiance Fields (NeRFs)
This dataset offers a comprehensive collection of sensory data, featuring RGB images, eye-tracking camera footage, audio recordings from a microphone, atmospheric pressure readings from a barometer, positional coordinates from GPS, and information from dual-frequency IMU datasets (1kHz and 800Hz)
The diverse data modalities and the real-world context captured within this dataset serve as a robust foundation for furthering our understanding of human behavior and enabling more immersive and intelligent experiences in
arXiv Detail & Related papers (2023-11-11T01:56:35Z) - RH20T: A Comprehensive Robotic Dataset for Learning Diverse Skills in
One-Shot [56.130215236125224]
A key challenge in robotic manipulation in open domains is how to acquire diverse and generalizable skills for robots.
Recent research in one-shot imitation learning has shown promise in transferring trained policies to new tasks based on demonstrations.
This paper aims to unlock the potential for an agent to generalize to hundreds of real-world skills with multi-modal perception.
arXiv Detail & Related papers (2023-07-02T15:33:31Z) - mRI: Multi-modal 3D Human Pose Estimation Dataset using mmWave, RGB-D,
and Inertial Sensors [6.955796938573367]
We present mRI, a multi-modal 3D human pose estimation dataset with mmWave, RGB-D, and Inertial Sensors.
Our dataset consists of over 160k synchronized frames from 20 subjects performing rehabilitation exercises.
arXiv Detail & Related papers (2022-10-15T23:08:44Z) - MetaGraspNet: A Large-Scale Benchmark Dataset for Scene-Aware
Ambidextrous Bin Picking via Physics-based Metaverse Synthesis [72.85526892440251]
We introduce MetaGraspNet, a large-scale photo-realistic bin picking dataset constructed via physics-based metaverse synthesis.
The proposed dataset contains 217k RGBD images across 82 different article types, with full annotations for object detection, amodal perception, keypoint detection, manipulation order and ambidextrous grasp labels for a parallel-jaw and vacuum gripper.
We also provide a real dataset consisting of over 2.3k fully annotated high-quality RGBD images, divided into 5 levels of difficulties and an unseen object set to evaluate different object and layout properties.
arXiv Detail & Related papers (2022-08-08T08:15:34Z) - ObjectFolder: A Dataset of Objects with Implicit Visual, Auditory, and
Tactile Representations [52.226947570070784]
We present Object, a dataset of 100 objects that addresses both challenges with two key innovations.
First, Object encodes the visual, auditory, and tactile sensory data for all objects, enabling a number of multisensory object recognition tasks.
Second, Object employs a uniform, object-centric simulations, and implicit representation for each object's visual textures, tactile readings, and tactile readings, making the dataset flexible to use and easy to share.
arXiv Detail & Related papers (2021-09-16T14:00:59Z) - One-Shot Object Affordance Detection in the Wild [76.46484684007706]
Affordance detection refers to identifying the potential action possibilities of objects in an image.
We devise a One-Shot Affordance Detection Network (OSAD-Net) that estimates the human action purpose and then transfers it to help detect the common affordance from all candidate images.
With complex scenes and rich annotations, our PADv2 dataset can be used as a test bed to benchmark affordance detection methods.
arXiv Detail & Related papers (2021-08-08T14:53:10Z) - Semantics-aware Adaptive Knowledge Distillation for Sensor-to-Vision
Action Recognition [131.6328804788164]
We propose a framework, named Semantics-aware Adaptive Knowledge Distillation Networks (SAKDN), to enhance action recognition in vision-sensor modality (videos)
The SAKDN uses multiple wearable-sensors as teacher modalities and uses RGB videos as student modality.
arXiv Detail & Related papers (2020-09-01T03:38:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.