Related papers: Beyond Domain Randomization: Event-Inspired Perception for Visually Robust Adversarial Imitation from Videos

Beyond Domain Randomization: Event-Inspired Perception for Visually Robust Adversarial Imitation from Videos

URL: http://arxiv.org/abs/2505.18899v1
Date: Sat, 24 May 2025 23:12:23 GMT
Title: Beyond Domain Randomization: Event-Inspired Perception for Visually Robust Adversarial Imitation from Videos
Authors: Andrea Ramazzina, Vittorio Giammarino, Matteo El-Hariry, Mario Bijelic,
Abstract summary: Imitation from videos often fails when expert demonstrations and learner environments exhibit domain shifts.<n>We propose a different approach: instead of randomizing appearances, we eliminate their influence entirely by rethinking the sensory representation itself.<n>Our method converts standard RGB videos into a sparse, event-based representation that encodes temporal intensity gradients.
Score: 4.338232204525725
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Imitation from videos often fails when expert demonstrations and learner environments exhibit domain shifts, such as discrepancies in lighting, color, or texture. While visual randomization partially addresses this problem by augmenting training data, it remains computationally intensive and inherently reactive, struggling with unseen scenarios. We propose a different approach: instead of randomizing appearances, we eliminate their influence entirely by rethinking the sensory representation itself. Inspired by biological vision systems that prioritize temporal transients (e.g., retinal ganglion cells) and by recent sensor advancements, we introduce event-inspired perception for visually robust imitation. Our method converts standard RGB videos into a sparse, event-based representation that encodes temporal intensity gradients, discarding static appearance features. This biologically grounded approach disentangles motion dynamics from visual style, enabling robust visual imitation from observations even in the presence of visual mismatches between expert and agent environments. By training policies on event streams, we achieve invariance to appearance-based distractors without requiring computationally expensive and environment-specific data augmentation techniques. Experiments across the DeepMind Control Suite and the Adroit platform for dynamic dexterous manipulation show the efficacy of our method. Our code is publicly available at Eb-LAIfO.

Related papers

FlashGuard: Novel Method in Evaluating Differential Characteristics of Visual Stimuli for Deterring Seizure Triggers in Photosensitive Epilepsy [0.0]
Individuals with photosensitive epilepsy (PSE) encounter challenges when using devices.<n>The current norm for preventing epileptic flashes in media is to detect asynchronously when a flash will occur in a video, then notifying the user.<n>FlashGuard, a novel approach, was devised to assess the rate of change of colors in frames across the user's screen and appropriately mitigate stimuli.
arXiv Detail & Related papers (2025-07-25T22:18:25Z)
Zero-Shot Visual Generalization in Robot Manipulation [0.13280779791485384]
Current approaches often sidestep the problem by relying on invariant representations such as point clouds and depth.<n>Disentangled representation learning has recently shown promise in enabling vision-based reinforcement learning policies to be robust to visual distribution shifts.<n>We demonstrate zero-shot adaptability to visual perturbations in both simulation and on real hardware.
arXiv Detail & Related papers (2025-05-16T22:01:46Z)
Don't Judge by the Look: Towards Motion Coherent Video Representation [56.09346222721583]
Motion Coherent Augmentation (MCA) is a data augmentation method for video understanding. MCA introduces appearance variation in videos and implicitly encourages the model to prioritize motion patterns, rather than static appearances.
arXiv Detail & Related papers (2024-03-14T15:53:04Z)
Neural feels with neural fields: Visuo-tactile perception for in-hand manipulation [57.60490773016364]
We combine vision and touch sensing on a multi-fingered hand to estimate an object's pose and shape during in-hand manipulation. Our method, NeuralFeels, encodes object geometry by learning a neural field online and jointly tracks it by optimizing a pose graph problem. Our results demonstrate that touch, at the very least, refines and, at the very best, disambiguates visual estimates during in-hand manipulation.
arXiv Detail & Related papers (2023-12-20T22:36:37Z)
What Makes Pre-Trained Visual Representations Successful for Robust Manipulation? [57.92924256181857]
We find that visual representations designed for manipulation and control tasks do not necessarily generalize under subtle changes in lighting and scene texture. We find that emergent segmentation ability is a strong predictor of out-of-distribution generalization among ViT models.
arXiv Detail & Related papers (2023-11-03T18:09:08Z)
NeRF-DS: Neural Radiance Fields for Dynamic Specular Objects [63.04781030984006]
Dynamic Neural Radiance Field (NeRF) is a powerful algorithm capable of rendering photo-realistic novel view images from a monocular RGB video of a dynamic scene. We address the limitation by reformulating the neural radiance field function to be conditioned on surface position and orientation in the observation space. We evaluate our model based on the novel view synthesis quality with a self-collected dataset of different moving specular objects in realistic environments.
arXiv Detail & Related papers (2023-03-25T11:03:53Z)
Stochastic Coherence Over Attention Trajectory For Continuous Learning In Video Streams [64.82800502603138]
This paper proposes a novel neural-network-based approach to progressively and autonomously develop pixel-wise representations in a video stream. The proposed method is based on a human-like attention mechanism that allows the agent to learn by observing what is moving in the attended locations. Our experiments leverage 3D virtual environments and they show that the proposed agents can learn to distinguish objects just by observing the video stream.
arXiv Detail & Related papers (2022-04-26T09:52:31Z)
FakeTransformer: Exposing Face Forgery From Spatial-Temporal Representation Modeled By Facial Pixel Variations [8.194624568473126]
Face forgery can attack any target, which poses a new threat to personal privacy and property security. Inspired by the fact that the spatial coherence and temporal consistency of physiological signal are destroyed in the generated content, we attempt to find inconsistent patterns that can distinguish between real videos and synthetic videos.
arXiv Detail & Related papers (2021-11-15T08:44:52Z)
Leveraging Semantic Scene Characteristics and Multi-Stream Convolutional Architectures in a Contextual Approach for Video-Based Visual Emotion Recognition in the Wild [31.40575057347465]
We tackle the task of video-based visual emotion recognition in the wild. Standard methodologies that rely solely on the extraction of bodily and facial features often fall short of accurate emotion prediction. We aspire to alleviate this problem by leveraging visual context in the form of scene characteristics and attributes.
arXiv Detail & Related papers (2021-05-16T17:31:59Z)
Unsupervised Feature Learning for Manipulation with Contrastive Domain Randomization [19.474628552656764]
We show that a naive application of domain randomization to unsupervised learning does not promote invariance. We propose a simple modification of the contrastive loss to fix this, exploiting the fact that we can control the simulated randomization of visual properties.
arXiv Detail & Related papers (2021-03-20T09:54:45Z)
Non-Rigid Neural Radiance Fields: Reconstruction and Novel View Synthesis of a Dynamic Scene From Monocular Video [76.19076002661157]
Non-Rigid Neural Radiance Fields (NR-NeRF) is a reconstruction and novel view synthesis approach for general non-rigid dynamic scenes. We show that even a single consumer-grade camera is sufficient to synthesize sophisticated renderings of a dynamic scene from novel virtual camera views.
arXiv Detail & Related papers (2020-12-22T18:46:12Z)
Self-supervised Video Representation Learning by Uncovering Spatio-temporal Statistics [74.6968179473212]
This paper proposes a novel pretext task to address the self-supervised learning problem. We compute a series of partitioning-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion. A neural network is built and trained to yield the statistical summaries given the video frames as inputs.
arXiv Detail & Related papers (2020-08-31T08:31:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.