TENT: Connect Language Models with IoT Sensors for Zero-Shot Activity
Recognition
- URL: http://arxiv.org/abs/2311.08245v1
- Date: Tue, 14 Nov 2023 15:30:17 GMT
- Title: TENT: Connect Language Models with IoT Sensors for Zero-Shot Activity
Recognition
- Authors: Yunjiao Zhou, Jianfei Yang, Han Zou, Lihua Xie
- Abstract summary: This paper explores the feasibility of building an intelligent Human Activity Recognition (HAR) system with human-like cognition.
We propose an innovative approach, IoT-sEnsors-language alignmEnt pre-Training (TENT), which aligns textual embeddings with IoT sensor signals.
We show TENT achieves state-of-the-art performance on zero-shot HAR tasks using different modalities, improving the best vision-language models by over 12%.
- Score: 35.816500811872196
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent achievements in language models have showcased their extraordinary
capabilities in bridging visual information with semantic language
understanding. This leads us to a novel question: can language models connect
textual semantics with IoT sensory signals to perform recognition tasks, e.g.,
Human Activity Recognition (HAR)? If so, an intelligent HAR system with
human-like cognition can be built, capable of adapting to new environments and
unseen categories. This paper explores its feasibility with an innovative
approach, IoT-sEnsors-language alignmEnt pre-Training (TENT), which jointly
aligns textual embeddings with IoT sensor signals, including camera video,
LiDAR, and mmWave. Through the IoT-language contrastive learning, we derive a
unified semantic feature space that aligns multi-modal features with language
embeddings, so that the IoT data corresponds to specific words that describe
the IoT data. To enhance the connection between textual categories and their
IoT data, we propose supplementary descriptions and learnable prompts that
bring more semantic information into the joint feature space. TENT can not only
recognize actions that have been seen but also ``guess'' the unseen action by
the closest textual words from the feature space. We demonstrate TENT achieves
state-of-the-art performance on zero-shot HAR tasks using different modalities,
improving the best vision-language models by over 12%.
Related papers
- Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data [84.01401439030265]
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs)
We present a simple yet effective automatic process for creating speech-text pair data.
Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data.
arXiv Detail & Related papers (2024-09-30T07:01:21Z) - Spatio-Temporal Context Prompting for Zero-Shot Action Detection [13.22912547389941]
We propose a method which can effectively leverage the rich knowledge of visual-language models to perform Person-Context Interaction.
To address the challenge of recognizing distinct actions by multiple people at the same timestamp, we design the Interest Token Spotting mechanism.
Our method achieves superior results compared to previous approaches and can be further extended to multi-action videos.
arXiv Detail & Related papers (2024-08-28T17:59:05Z) - Leveraging Foundation Models for Zero-Shot IoT Sensing [5.319176383069102]
Deep learning models are increasingly deployed on edge Internet of Things (IoT) devices.
ZSL aims to classify data of unseen classes with the help of semantic information.
In this work, we align the IoT data embeddings with the semantic embeddings generated by an FM's text encoder for zero-shot IoT sensing.
arXiv Detail & Related papers (2024-07-29T11:16:48Z) - Self-Supervised Representation Learning with Spatial-Temporal Consistency for Sign Language Recognition [96.62264528407863]
We propose a self-supervised contrastive learning framework to excavate rich context via spatial-temporal consistency.
Inspired by the complementary property of motion and joint modalities, we first introduce first-order motion information into sign language modeling.
Our method is evaluated with extensive experiments on four public benchmarks, and achieves new state-of-the-art performance with a notable margin.
arXiv Detail & Related papers (2024-06-15T04:50:19Z) - Towards Zero-shot Human-Object Interaction Detection via Vision-Language
Integration [14.678931157058363]
We propose a novel framework, termed Knowledge Integration to HOI (KI2HOI), that effectively integrates the knowledge of visual-language model to improve zero-shot HOI detection.
We develop an effective additive self-attention mechanism to generate more comprehensive visual representations.
Our model outperforms the previous methods in various zero-shot and full-supervised settings.
arXiv Detail & Related papers (2024-03-12T02:07:23Z) - SHINE: Syntax-augmented Hierarchical Interactive Encoder for Zero-shot
Cross-lingual Information Extraction [47.88887327545667]
In this study, a syntax-augmented hierarchical interactive encoder (SHINE) is proposed to transfer cross-lingual IE knowledge.
SHINE is capable of interactively capturing complementary information between features and contextual information.
Experiments across seven languages on three IE tasks and four benchmarks verify the effectiveness and generalization ability of the proposed method.
arXiv Detail & Related papers (2023-05-21T08:02:06Z) - KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation [61.08389704326803]
Vision-and-language navigation (VLN) is the task to enable an embodied agent to navigate to a remote location following the natural language instruction in real scenes.
Most of the previous approaches utilize the entire features or object-centric features to represent navigable candidates.
We propose a Knowledge Enhanced Reasoning Model (KERM) to leverage knowledge to improve agent navigation ability.
arXiv Detail & Related papers (2023-03-28T08:00:46Z) - The Internet of Senses: Building on Semantic Communications and Edge
Intelligence [67.75406096878321]
The Internet of Senses (IoS) holds the promise of flawless telepresence-style communication for all human receptors'
We elaborate on how the emerging semantic communications and Artificial Intelligence (AI)/Machine Learning (ML) paradigms may satisfy the requirements of IoS use cases.
arXiv Detail & Related papers (2022-12-21T03:37:38Z) - AttViz: Online exploration of self-attention for transparent neural
language modeling [7.574392147428978]
We propose AttViz, an online toolkit for exploration of self-attention---real values associated with individual text tokens.
We show how existing deep learning pipelines can produce outputs suitable for AttViz, offering novel visualizations of the attention heads and their aggregations with minimal effort, online.
arXiv Detail & Related papers (2020-05-12T12:21:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.