Related papers: HapticLLaMA: A Multimodal Sensory Language Model for Haptic Captioning

HapticLLaMA: A Multimodal Sensory Language Model for Haptic Captioning

URL: http://arxiv.org/abs/2508.06475v1
Date: Fri, 08 Aug 2025 17:25:37 GMT
Title: HapticLLaMA: A Multimodal Sensory Language Model for Haptic Captioning
Authors: Guimin Hu, Daniel Hershcovich, Hasti Seifi,
Abstract summary: HapticLLaMA is a multimodal sensory language model that interprets vibration signals into descriptions in a given sensory, emotional, or associative category.<n>HapticLLaMA is trained in two stages: (1) supervised fine-tuning using the LLaMA architecture with LoRA-based adaptation, and (2) fine-tuning via reinforcement learning from human feedback.<n>HapticLLaMA demonstrates strong capability in interpreting haptic vibration signals, achieving a METEOR score of 59.98 and a BLEU-4 score of 32.06 respectively.
Score: 16.01096757075079
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Haptic captioning is the task of generating natural language descriptions from haptic signals, such as vibrations, for use in virtual reality, accessibility, and rehabilitation applications. While previous multimodal research has focused primarily on vision and audio, haptic signals for the sense of touch remain underexplored. To address this gap, we formalize the haptic captioning task and propose HapticLLaMA, a multimodal sensory language model that interprets vibration signals into descriptions in a given sensory, emotional, or associative category. We investigate two types of haptic tokenizers, a frequency-based tokenizer and an EnCodec-based tokenizer, that convert haptic signals into sequences of discrete units, enabling their integration with the LLaMA model. HapticLLaMA is trained in two stages: (1) supervised fine-tuning using the LLaMA architecture with LoRA-based adaptation, and (2) fine-tuning via reinforcement learning from human feedback (RLHF). We assess HapticLLaMA's captioning performance using both automated n-gram metrics and human evaluation. HapticLLaMA demonstrates strong capability in interpreting haptic vibration signals, achieving a METEOR score of 59.98 and a BLEU-4 score of 32.06 respectively. Additionally, over 61% of the generated captions received human ratings above 3.5 on a 7-point scale, with RLHF yielding a 10% improvement in the overall rating distribution, indicating stronger alignment with human haptic perception. These findings highlight the potential of large language models to process and adapt to sensory data.

Related papers

Arabic Sign Language Recognition using Multimodal Approach [0.0]
Arabic Sign Language (ArSL) is an essential communication method for individuals in the Deaf and Hard-of-Hearing community.<n>Existing recognition systems face significant challenges due to their reliance on single sensor approaches like Leap Motion or RGB cameras.<n>This research paper aims to investigate the potential of a multimodal approach that combines Leap Motion and RGB camera data to explore the feasibility of recognition of ArSL.
arXiv Detail & Related papers (2026-01-20T09:21:43Z)
E^2-LLM: Bridging Neural Signals and Interpretable Affective Analysis [54.763420895859035]
We present ELLM2-EEG-to-Emotion Large Language Model, first MLLM framework for interpretable emotion analysis from EEG.<n>ELLM integrates a pretrained EEG encoder with Q-based LLMs through learnable projection layers, employing a multi-stage training pipeline.<n>Experiments on the dataset across seven emotion categories demonstrate that ELLM2-EEG-to-Emotion Large Language Model achieves excellent performance on emotion classification.
arXiv Detail & Related papers (2026-01-11T13:21:20Z)
CAT-Net: A Cross-Attention Tone Network for Cross-Subject EEG-EMG Fusion Tone Decoding [0.8714814768600078]
We propose a novel cross-subject multimodal BCI decoding framework.<n>It fuses EEG and EMG signals to classify four Mandarin tones under both audible and silent speech conditions.<n>Our findings suggest that tone-level decoding with minimal EEG-EMG channels is feasible and potentially generalizable across subjects.
arXiv Detail & Related papers (2025-11-14T03:50:54Z)
WaveMind: Towards a Conversational EEG Foundation Model Aligned to Textual and Visual Modalities [55.00677513249723]
EEG signals simultaneously encode both cognitive processes and intrinsic neural states.<n>We map EEG signals and their corresponding modalities into a unified semantic space to achieve generalized interpretation.<n>The resulting model demonstrates robust classification accuracy while supporting flexible, open-ended conversations.
arXiv Detail & Related papers (2025-09-26T06:21:51Z)
MOSPA: Human Motion Generation Driven by Spatial Audio [56.735282455483954]
We introduce the first comprehensive Spatial Audio-Driven Human Motion dataset, which contains diverse and high-quality spatial audio and motion data.<n>We develop a simple yet effective diffusion-based generative framework for human MOtion generation driven by SPatial Audio, termed MOSPA.<n>Once trained, MOSPA could generate diverse realistic human motions conditioned on varying spatial audio inputs.
arXiv Detail & Related papers (2025-07-16T06:33:11Z)
HoloLLM: Multisensory Foundation Model for Language-Grounded Human Sensing and Reasoning [14.038083767470019]
Embodied agents operating in smart homes must understand human behavior through diverse sensory inputs and communicate via natural language.<n>In this paper, we introduce HoloLLM, a Multimodal Large Language Model (MLLM) that integrates uncommon but powerful sensing modalities.<n>We show that HoloLLM significantly outperforms existing MLLMs, improving language-grounded human sensing accuracy by up to 30%.
arXiv Detail & Related papers (2025-05-23T09:06:09Z)
MADUV: The 1st INTERSPEECH Mice Autism Detection via Ultrasound Vocalization Challenge [39.014730677559974]
The Mice Autism Detection via Ultrasound Vocalization (MADUV) Challenge introduces the first INTERSPEECH challenge focused on detecting autism spectrum disorder (ASD) in mice through their vocalizations.<n>Participants are tasked with developing models to automatically classify mice as either wild-type or ASD models based on recordings with a high sampling rate.<n>Results demonstrate the feasibility of automated ASD detection, with the considered audible-range features achieving the best performance.
arXiv Detail & Related papers (2025-01-08T05:32:55Z)
Bridging Auditory Perception and Language Comprehension through MEG-Driven Encoding Models [0.12289361708127873]
We use Magnetoencephalography (MEG) data to analyze brain responses to spoken language stimuli.<n>We develop two distinct encoding models: an audio-to-MEG encoder, and a text-to-MEG encoder.<n>Both models successfully predict neural activity, demonstrating significant correlations between estimated and observed MEG signals.
arXiv Detail & Related papers (2024-12-22T19:41:54Z)
Grounding Emotional Descriptions to Electrovibration Haptic Signals [4.551032947977237]
Free-form user language provides rich sensory and emotional information for haptic design. We developed a computational pipeline to extract sensory and emotional keywords and group them into semantic clusters. The proposed pipeline demonstrates the viability of a computational approach to analyzing haptic experiences.
arXiv Detail & Related papers (2024-11-04T14:30:57Z)
BrainECHO: Semantic Brain Signal Decoding through Vector-Quantized Spectrogram Reconstruction for Whisper-Enhanced Text Generation [48.20672677492805]
Current EEG/MEG-to-text decoding systems suffer from three key limitations.<n>BrainECHO is a multi-stage framework that employs decoupled representation learning.<n>BrainECHO demonstrates robustness across sentence, session, and subject-independent conditions.
arXiv Detail & Related papers (2024-10-19T04:29:03Z)
Self-supervised models of audio effectively explain human cortical responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system. We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z)
Audio-visual multi-channel speech separation, dereverberation and recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach. The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches. Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z)
Multimodal Emotion Recognition using Transfer Learning from Speaker Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities. We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z)
Extracting the Locus of Attention at a Cocktail Party from Single-Trial EEG using a Joint CNN-LSTM Model [0.1529342790344802]
Human brain performs remarkably well in segregating a particular speaker from interfering speakers in a multi-speaker scenario. We present a joint convolutional neural network (CNN) - long short-term memory (LSTM) model to infer the auditory attention.
arXiv Detail & Related papers (2021-02-08T01:06:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.