Multimodal perception for dexterous manipulation
- URL: http://arxiv.org/abs/2112.14298v1
- Date: Tue, 28 Dec 2021 21:20:26 GMT
- Title: Multimodal perception for dexterous manipulation
- Authors: Guanqun Cao and Shan Luo
- Abstract summary: We propose a cross-modal sensory data generation framework for the translation between vision and touch.
We propose a-temporal attention model for tactile texture recognition, which takes both spatial features and time dimension into consideration.
- Score: 14.314776558032166
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Humans usually perceive the world in a multimodal way that vision, touch,
sound are utilised to understand surroundings from various dimensions. These
senses are combined together to achieve a synergistic effect where the learning
is more effectively than using each sense separately. For robotics, vision and
touch are two key senses for the dexterous manipulation. Vision usually gives
us apparent features like shape, color, and the touch provides local
information such as friction, texture, etc. Due to the complementary properties
between visual and tactile senses, it is desirable for us to combine vision and
touch for a synergistic perception and manipulation. Many researches have been
investigated about multimodal perception such as cross-modal learning, 3D
reconstruction, multimodal translation with vision and touch. Specifically, we
propose a cross-modal sensory data generation framework for the translation
between vision and touch, which is able to generate realistic pseudo data. By
using this cross-modal translation method, it is desirable for us to make up
inaccessible data, helping us to learn the object's properties from different
views. Recently, the attention mechanism becomes a popular method either in
visual perception or in tactile perception. We propose a spatio-temporal
attention model for tactile texture recognition, which takes both spatial
features and time dimension into consideration. Our proposed method not only
pays attention to the salient features in each spatial feature, but also models
the temporal correlation in the through the time. The obvious improvement
proves the efficiency of our selective attention mechanism. The spatio-temporal
attention method has potential in many applications such as grasping,
recognition, and multimodal perception.
Related papers
- Self-supervised Spatio-Temporal Graph Mask-Passing Attention Network for Perceptual Importance Prediction of Multi-point Tactility [8.077951761948556]
We develop a model to predict tactile perceptual importance at multiple points, based on self-supervised learning and Spatio-Temporal Graph Neural Network.
Results indicate that this model can effectively predict the perceptual importance of various points in multi-point tactile perception scenarios.
arXiv Detail & Related papers (2024-10-04T13:45:50Z) - Emotion Recognition from the perspective of Activity Recognition [0.0]
Appraising human emotional states, behaviors, and reactions displayed in real-world settings can be accomplished using latent continuous dimensions.
For emotion recognition systems to be deployed and integrated into real-world mobile and computing devices, we need to consider data collected in the world.
We propose a novel three-stream end-to-end deep learning regression pipeline with an attention mechanism.
arXiv Detail & Related papers (2024-03-24T18:53:57Z) - Multimodal Visual-Tactile Representation Learning through
Self-Supervised Contrastive Pre-Training [0.850206009406913]
MViTac is a novel methodology that leverages contrastive learning to integrate vision and touch sensations in a self-supervised fashion.
By availing both sensory inputs, MViTac leverages intra and inter-modality losses for learning representations, resulting in enhanced material property classification and more adept grasping prediction.
arXiv Detail & Related papers (2024-01-22T15:11:57Z) - Neural feels with neural fields: Visuo-tactile perception for in-hand
manipulation [57.60490773016364]
We combine vision and touch sensing on a multi-fingered hand to estimate an object's pose and shape during in-hand manipulation.
Our method, NeuralFeels, encodes object geometry by learning a neural field online and jointly tracks it by optimizing a pose graph problem.
Our results demonstrate that touch, at the very least, refines and, at the very best, disambiguates visual estimates during in-hand manipulation.
arXiv Detail & Related papers (2023-12-20T22:36:37Z) - The Power of the Senses: Generalizable Manipulation from Vision and
Touch through Masked Multimodal Learning [60.91637862768949]
We propose Masked Multimodal Learning (M3L) to fuse visual and tactile information in a reinforcement learning setting.
M3L learns a policy and visual-tactile representations based on masked autoencoding.
We evaluate M3L on three simulated environments with both visual and tactile observations.
arXiv Detail & Related papers (2023-11-02T01:33:00Z) - Tactile-Filter: Interactive Tactile Perception for Part Mating [54.46221808805662]
Humans rely on touch and tactile sensing for a lot of dexterous manipulation tasks.
vision-based tactile sensors are being widely used for various robotic perception and control tasks.
We present a method for interactive perception using vision-based tactile sensors for a part mating task.
arXiv Detail & Related papers (2023-03-10T16:27:37Z) - See, Hear, and Feel: Smart Sensory Fusion for Robotic Manipulation [49.925499720323806]
We study how visual, auditory, and tactile perception can jointly help robots to solve complex manipulation tasks.
We build a robot system that can see with a camera, hear with a contact microphone, and feel with a vision-based tactile sensor.
arXiv Detail & Related papers (2022-12-07T18:55:53Z) - Vision+X: A Survey on Multimodal Learning in the Light of Data [64.03266872103835]
multimodal machine learning that incorporates data from various sources has become an increasingly popular research area.
We analyze the commonness and uniqueness of each data format mainly ranging from vision, audio, text, and motions.
We investigate the existing literature on multimodal learning from both the representation learning and downstream application levels.
arXiv Detail & Related papers (2022-10-05T13:14:57Z) - Perception Over Time: Temporal Dynamics for Robust Image Understanding [5.584060970507506]
Deep learning surpasses human-level performance in narrow and specific vision tasks.
Human visual perception is orders of magnitude more robust to changes in the input stimulus.
We introduce a novel method of incorporating temporal dynamics into static image understanding.
arXiv Detail & Related papers (2022-03-11T21:11:59Z) - What Can You Learn from Your Muscles? Learning Visual Representation
from Human Interactions [50.435861435121915]
We use human interaction and attention cues to investigate whether we can learn better representations compared to visual-only representations.
Our experiments show that our "muscly-supervised" representation outperforms a visual-only state-of-the-art method MoCo.
arXiv Detail & Related papers (2020-10-16T17:46:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.