Related papers: Learning Priors of Human Motion With Vision Transformers

Learning Priors of Human Motion With Vision Transformers

URL: http://arxiv.org/abs/2501.18543v1
Date: Thu, 30 Jan 2025 18:12:11 GMT
Title: Learning Priors of Human Motion With Vision Transformers
Authors: Placido Falqueto, Alberto Sanfeliu, Luigi Palopoli, Daniele Fontanelli,
Abstract summary: We propose a neural architecture based on Vision Transformers (ViTs) to provide this information.<n>This solution can arguably capture spatial correlations more effectively than Convolutional Neural Networks (CNNs)
Score: 5.739073185982992
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A clear understanding of where humans move in a scenario, their usual paths and speeds, and where they stop, is very important for different applications, such as mobility studies in urban areas or robot navigation tasks within human-populated environments. We propose in this article, a neural architecture based on Vision Transformers (ViTs) to provide this information. This solution can arguably capture spatial correlations more effectively than Convolutional Neural Networks (CNNs). In the paper, we describe the methodology and proposed neural architecture and show the experiments' results with a standard dataset. We show that the proposed ViT architecture improves the metrics compared to a method based on a CNN.

Related papers

T-TAME: Trainable Attention Mechanism for Explaining Convolutional Networks and Vision Transformers [9.284740716447342]
"Black box" nature of neural networks is a barrier to adoption in applications where explainability is essential. This paper presents T-TAME, Transformer-compatible Trainable Attention Mechanism for Explanations. Proposed architecture and training technique can be easily applied to any convolutional or Vision Transformer-like neural network.
arXiv Detail & Related papers (2024-03-07T14:25:03Z)
Convolutional Initialization for Data-Efficient Vision Transformers [38.63299194992718]
Training vision transformer networks on small datasets poses challenges. CNNs can achieve state-of-the-art performance by leveraging their architectural inductive bias. Our approach is motivated by the finding that random impulse filters can achieve almost comparable performance to learned filters in CNNs.
arXiv Detail & Related papers (2024-01-23T06:03:16Z)
Semantic Segmentation using Vision Transformers: A survey [0.0]
Convolutional neural networks (CNN) and Vision Transformers (ViTs) provide the architecture models for semantic segmentation. ViTs have proven success in image classification, they cannot be directly applied to dense prediction tasks such as image segmentation and object detection. This survey aims to review and compare the performances of ViT architectures designed for semantic segmentation using benchmarking datasets.
arXiv Detail & Related papers (2023-05-05T04:11:00Z)
What do Vision Transformers Learn? A Visual Exploration [68.50771218442776]
Vision transformers (ViTs) are quickly becoming the de-facto architecture for computer vision. This paper addresses the obstacles to performing visualizations on ViTs and explores the underlying differences between ViTs and CNNs. We also conduct large-scale visualizations on a range of ViT variants, including DeiT, CoaT, ConViT, PiT, Swin, and Twin.
arXiv Detail & Related papers (2022-12-13T16:55:12Z)
Comparison Analysis of Traditional Machine Learning and Deep Learning Techniques for Data and Image Classification [62.997667081978825]
The purpose of the study is to analyse and compare the most common machine learning and deep learning techniques used for computer vision 2D object classification tasks. Firstly, we will present the theoretical background of the Bag of Visual words model and Deep Convolutional Neural Networks (DCNN) Secondly, we will implement a Bag of Visual Words model, the VGG16 CNN Architecture.
arXiv Detail & Related papers (2022-04-11T11:34:43Z)
Towards Scale Consistent Monocular Visual Odometry by Learning from the Virtual World [83.36195426897768]
We propose VRVO, a novel framework for retrieving the absolute scale from virtual data. We first train a scale-aware disparity network using both monocular real images and stereo virtual data. The resulting scale-consistent disparities are then integrated with a direct VO system.
arXiv Detail & Related papers (2022-03-11T01:51:54Z)
Searching the Search Space of Vision Transformer [98.96601221383209]
Vision Transformer has shown great visual representation power in substantial vision tasks such as recognition and detection. We propose to use neural architecture search to automate this process, by searching not only the architecture but also the search space. We provide design guidelines of general vision transformers with extensive analysis according to the space searching process.
arXiv Detail & Related papers (2021-11-29T17:26:07Z)
Towards Optimal Strategies for Training Self-Driving Perception Models in Simulation [98.51313127382937]
We focus on the use of labels in the synthetic domain alone. Our approach introduces both a way to learn neural-invariant representations and a theoretically inspired view on how to sample the data from the simulator. We showcase our approach on the bird's-eye-view vehicle segmentation task with multi-sensor data.
arXiv Detail & Related papers (2021-11-15T18:37:43Z)
Do Vision Transformers See Like Convolutional Neural Networks? [45.69780772718875]
Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks. Are they acting like convolutional networks, or learning entirely different visual representations? We find striking differences between the two architectures, such as ViT having more uniform representations across all layers.
arXiv Detail & Related papers (2021-08-19T17:27:03Z)
Efficient Training of Visual Transformers with Small-Size Datasets [64.60765211331697]
Visual Transformers (VTs) are emerging as an architectural paradigm alternative to Convolutional networks (CNNs) We show that, despite having a comparable accuracy when trained on ImageNet, their performance on smaller datasets can be largely different. We propose a self-supervised task which can extract additional information from images with only a negligible computational overhead.
arXiv Detail & Related papers (2021-06-07T16:14:06Z)
Camera-Based Adaptive Trajectory Guidance via Neural Networks [6.2843107854856965]
We introduce a novel method to capture visual trajectories for navigating an indoor robot in dynamic settings using streaming image data. The captured trajectories are used to design, train, and compare two neural network architectures for predicting acceleration and steering commands for a line following robot over a continuous space in real time.
arXiv Detail & Related papers (2020-01-09T20:05:25Z)
Counterfactual Vision-and-Language Navigation via Adversarial Path Sampling [65.99956848461915]
Vision-and-Language Navigation (VLN) is a task where agents must decide how to move through a 3D environment to reach a goal. One of the problems of the VLN task is data scarcity since it is difficult to collect enough navigation paths with human-annotated instructions for interactive environments. We propose an adversarial-driven counterfactual reasoning model that can consider effective conditions instead of low-quality augmented data.
arXiv Detail & Related papers (2019-11-17T18:02:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.