Related papers: Zero-Shot Underwater Gesture Recognition

Zero-Shot Underwater Gesture Recognition

URL: http://arxiv.org/abs/2407.14103v1
Date: Fri, 19 Jul 2024 08:16:46 GMT
Title: Zero-Shot Underwater Gesture Recognition
Authors: Sandipan Sarma, Gundameedi Sai Ram Mohan, Hariansh Sehgal, Arijit Sur,
Abstract summary: Hand gesture recognition allows humans to interact with machines non-verbally, which has a huge application in underwater exploration using autonomous underwater vehicles. Recently, a new gesture-based language called CADDIAN has been devised for divers, and supervised learning methods have been applied to recognize the gestures with high accuracy. In this work, we advocate the need for zero-shot underwater gesture recognition (ZSUGR), where the objective is to train a model with visual samples of gestures from a few seen'' classes only and transfer the gained knowledge at test time to recognize semantically-similar unseen gesture classes as well.
Score: 3.4078654008228924
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Hand gesture recognition allows humans to interact with machines non-verbally, which has a huge application in underwater exploration using autonomous underwater vehicles. Recently, a new gesture-based language called CADDIAN has been devised for divers, and supervised learning methods have been applied to recognize the gestures with high accuracy. However, such methods fail when they encounter unseen gestures in real time. In this work, we advocate the need for zero-shot underwater gesture recognition (ZSUGR), where the objective is to train a model with visual samples of gestures from a few ``seen'' classes only and transfer the gained knowledge at test time to recognize semantically-similar unseen gesture classes as well. After discussing the problem and dataset-specific challenges, we propose new seen-unseen splits for gesture classes in CADDY dataset. Then, we present a two-stage framework, where a novel transformer learns strong visual gesture cues and feeds them to a conditional generative adversarial network that learns to mimic feature distribution. We use the trained generator as a feature synthesizer for unseen classes, enabling zero-shot learning. Extensive experiments demonstrate that our method outperforms the existing zero-shot techniques. We conclude by providing useful insights into our framework and suggesting directions for future research.

Related papers

Understanding Co-speech Gestures in-the-wild [52.5993021523165]
We introduce a new framework for co-speech gesture understanding in the wild. We propose three new tasks and benchmarks to evaluate a model's capability to comprehend gesture-text-speech associations. We present a new approach that learns a tri-modal speech-text-video-gesture representation to solve these tasks.
arXiv Detail & Related papers (2025-03-28T17:55:52Z)
Retrieving Semantics from the Deep: an RAG Solution for Gesture Synthesis [55.45253486141108]
RAG-Gesture is a diffusion-based gesture generation approach to produce semantically rich gestures. We achieve this by using explicit domain knowledge to retrieve motions from a database of co-speech gestures. We propose a control paradigm for guidance, that allows the users to modulate the amount of influence each retrieval insertion has over the generated sequence.
arXiv Detail & Related papers (2024-12-09T18:59:46Z)
Adaptive Language-Guided Abstraction from Contrastive Explanations [53.48583372522492]
It is necessary to determine which features of the environment are relevant before determining how these features should be used to compute reward. End-to-end methods for joint feature and reward learning often yield brittle reward functions that are sensitive to spurious state features. This paper describes a method named ALGAE which alternates between using language models to iteratively identify human-meaningful features.
arXiv Detail & Related papers (2024-09-12T16:51:58Z)
Deep self-supervised learning with visualisation for automatic gesture recognition [1.6647755388646919]
Gesture is an important mean of non-verbal communication, with visual modality allows human to convey information during interaction, facilitating peoples and human-machine interactions. In this work, we explore three different means to recognise hand signs using deep learning: supervised learning based methods, self-supervised methods and visualisation based techniques applied to 3D moving skeleton data.
arXiv Detail & Related papers (2024-06-18T09:44:55Z)
Self-Supervised Representation Learning with Spatial-Temporal Consistency for Sign Language Recognition [96.62264528407863]
We propose a self-supervised contrastive learning framework to excavate rich context via spatial-temporal consistency. Inspired by the complementary property of motion and joint modalities, we first introduce first-order motion information into sign language modeling. Our method is evaluated with extensive experiments on four public benchmarks, and achieves new state-of-the-art performance with a notable margin.
arXiv Detail & Related papers (2024-06-15T04:50:19Z)
Towards Open-World Gesture Recognition [19.019579924491847]
In real-world applications involving gesture recognition, such as gesture recognition based on wrist-worn devices, the data distribution may change over time. We propose the use of continual learning to enable machine learning models to be adaptive to new tasks. We provide design guidelines to enhance the development of an open-world wrist-worn gesture recognition process.
arXiv Detail & Related papers (2024-01-20T06:45:16Z)
Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains. Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods. This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z)
Stochastic Coherence Over Attention Trajectory For Continuous Learning In Video Streams [64.82800502603138]
This paper proposes a novel neural-network-based approach to progressively and autonomously develop pixel-wise representations in a video stream. The proposed method is based on a human-like attention mechanism that allows the agent to learn by observing what is moving in the attended locations. Our experiments leverage 3D virtual environments and they show that the proposed agents can learn to distinguish objects just by observing the video stream.
arXiv Detail & Related papers (2022-04-26T09:52:31Z)
SHREC 2021: Track on Skeleton-based Hand Gesture Recognition in the Wild [62.450907796261646]
Recognition of hand gestures can be performed directly from the stream of hand skeletons estimated by software. Despite the recent advancements in gesture and action recognition from skeletons, it is unclear how well the current state-of-the-art techniques can perform in a real-world scenario. This paper presents the results of the SHREC 2021: Track on Skeleton-based Hand Gesture Recognition in the Wild contest.
arXiv Detail & Related papers (2021-06-21T10:57:49Z)
MS$^2$L: Multi-Task Self-Supervised Learning for Skeleton Based Action Recognition [36.74293548921099]
We integrate motion prediction, jigsaw puzzle recognition, and contrastive learning to learn skeleton features from different aspects. Our experiments on the NW-UCLA, NTU RGB+D, and PKUMMD datasets show remarkable performance for action recognition.
arXiv Detail & Related papers (2020-10-12T11:09:44Z)
A Prototype-Based Generalized Zero-Shot Learning Framework for Hand Gesture Recognition [5.992264231643021]
We propose an end-to-end prototype-based framework for hand gesture recognition. The first branch is a prototype-based detector that learns gesture representations. The second branch is a zero-shot label predictor which takes the features of unseen classes as input and outputs predictions.
arXiv Detail & Related papers (2020-09-29T12:18:35Z)
Visual Imitation Made Easy [102.36509665008732]
We present an alternate interface for imitation that simplifies the data collection process while allowing for easy transfer to robots. We use commercially available reacher-grabber assistive tools both as a data collection device and as the robot's end-effector. We experimentally evaluate on two challenging tasks: non-prehensile pushing and prehensile stacking, with 1000 diverse demonstrations for each task.
arXiv Detail & Related papers (2020-08-11T17:58:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.